grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.02k stars 521 forks source link

[Ingester] Intermittent KubePersistentVolumeFillingUp alerts #2129

Closed LasseHels closed 1 year ago

LasseHels commented 1 year ago

Describe the bug Our ingesters' PersistentVolumeClaims intermittently fill up, seemingly at random.

To Reproduce We have not been able to consistently reproduce the issue.

Expected behavior We would expect that the used space on the PersistentVolumeClaim remains fairly consistent, and does not spike.

Environment:

Additional Context We run a total of nine ingesters; three in each zone across three zones.

For the past couple of months, the used space of an ingester's PVC will start gradually increasing. This gradual increase continues until it approaches 100%, at which point we have to manually intervene.

Here are a couple of screenshots of the PVC space usage when it happens:

  1. Between 2023-02-21T00:25:00Z and 2023-02-21T04:00:00Z, the PVC of one of our ingesters in zone A went from ~3GB of usage to ~19GB: Screenshot 2023-02-23 at 09 23 59
  2. Between 2023-02-21T23:50:00Z and 2023-02-22T02:50:00Z, the PVC of one of our ingesters in zone C went from ~ 3GB of usage to ~18.4GB: Screenshot 2023-02-23 at 09 26 26
  3. Between 2023-02-12T20:20:00Z and 2023-02-13T15:28:00Z, the PVC of one of our ingesters in zone C went from ~2GB of usage to ~18.9GB: Screenshot 2023-02-23 at 09 30 43

We have observed this issue across different Tempo versions, namely v1.5.0 and v2.0.0 (which we recently upgraded to).

Initially, we figured that it was an issue with the flushing mechanism of the ingester, but preliminary log analysis has been unable to confirm this theory.

Using our first example from above (the ingester from zone A): Error logs (excluding TRACE_TOO_LARGE) from the period before and after the increase in space usage: Screenshot 2023-02-23 at 09 37 55

As we can see, no error logs were emitted in the period leading up to the disk being full. The first error log emitted is at 2023-02-21T04:02:34, which is when there was no space left: "error completing wal block with local backend: error creating block: write /var/tempo/wal/blocks/single-tenant/f291dded-9dea-4dfa-985b-5984c5421f49/data.parquet: no space left on device"

For the record, I am also attaching the graph of all logs (excluding TRACE_TOO_LARGE): Screenshot 2023-02-23 at 09 40 22

This graph shows two spikes: one at ~2023-02-21T04:00:00Z, and a second one at ~2023-02-21T06:30:00Z. The first spike here are the no space left on device error messages; the second spike are logs from the manual intervention.

We would be much appreciative of any help with this issue, as we are somewhat stuck at the moment, and are unsure of how to proceed.

Please let me know of any additional information that may be of use in debugging the issue. I will make sure it is swiftly provided.

As a final note: I have created this issue as a bug, but I appreciate the fact that the problem can just as well be a misconfiguration on our side as it can be a bug in Tempo itself. Our team is very happy to use Tempo in general, and this is a nut that we cannot seem to crack on our own.

Thanks.

joe-elliott commented 1 year ago

Thanks for filing the issue. Not having any error logs is quite frustrating as it would be the most likely source of good information here. Some things we can do:

Is the ingester unexpectedly restarting? or even gracefully restarting before this issue occurs? I have seen something similar once, but it occurred after an ingester OOMed and then had issues reading its wal on restart.

We have also seen issues in AKS where mounted drives will swap to read only. Perhaps try to manually delete a file from the drive when this issue is occurring? Tempo should log if it fails to remove a file, but I think it's still a good test.

LasseHels commented 1 year ago

Hey @joe-elliott, thanks for your quick reply.

I've looked into the metrics you mentioned using the example of the ingester from zone A above. As a reminder, this ingester's PVC filled up from 2023-02-21T00:25:00Z to 2023-02-21T04:00:00Z.

Check tempo_ingester_blocks_flushed_total to see if it is flushing blocks at the same rate.

It definitely looks like the rate of flushes slowly degrades while the PVC is filling up: Screenshot 2023-02-24 at 10 13 25

Check tempo_ingester_failed_flushes_total to see if it we are metric'ing failed flushes but for some reason not logging info.

From what I can see, the only failed flushes emitted by the tempo_ingester_failed_flushes_total metric are the ones around 04:00 AM when the disk was already full: Screenshot 2023-02-24 at 10 15 09

As expected, this also corresponds exactly with the error logs we saw above.

Check tempodb_backend_request_duration_seconds_count to see if there's any network interruption between storage and the ingester.

Request duration looks pretty stable: Screenshot 2023-02-24 at 10 17 07

However, looking at the tail end of the yellow data points, this chart seems to also indicate that the rate of flushes slowly degrades.

Dump the contents of the /var/tempo/ folder (or wherever you are storing ingester data). Let's see if the wal files or the complete blocks are building up (or both).

I'll take a look at this the next time we the issue occurs.

Is the ingester unexpectedly restarting? or even gracefully restarting before this issue occurs?

Using the kube_pod_container_status_restarts_total as reference, there doesn't seem to be any restarts in the two days leading up to the issue: Screenshot 2023-02-24 at 10 22 04

We have also seen issues in AKS where mounted drives will swap to read only. Perhaps try to manually delete a file from the drive when this issue is occurring?

This sounds interesting, I'll try this out as well the next time the issue occurs.

I looked around the Tempo dashboards again, and I did find one anomaly: Memory (workingset) of the ingester. Check this out: Screenshot 2023-02-24 at 10 24 07

While CPU usage looks fairly steady, the used memory of the ingester definitely seems to spike right around 2023-02-21T00:25:00Z when the PVC started filling up.

At 2023-02-21T00:20:00Z, the memory usage of the ingester was 1.75GB, which increased to ~3.5GB in the next ten minutes and stayed there until ~2023-02-21T04:00:00Z when manual intervention took place. I'm not sure if this is a cause or a symptom of the issue.

liam-howe-maersk commented 1 year ago

Hi @joe-elliott, we caught a live instance of this issue occurring and I was able to dump the contents of /var/tempo, here is the result, does this tell us anything?

contents of /var/tempo ``` . ./lost+found ./wal ./wal/blocks ./wal/blocks/single-tenant ./wal/blocks/single-tenant/cc768785-60b5-4579-a337-0f7d80d5ae0d ./wal/blocks/single-tenant/cc768785-60b5-4579-a337-0f7d80d5ae0d/bloom-0 ./wal/blocks/single-tenant/cc768785-60b5-4579-a337-0f7d80d5ae0d/data.parquet ./wal/blocks/single-tenant/cc768785-60b5-4579-a337-0f7d80d5ae0d/meta.json ./wal/blocks/single-tenant/4250beda-c00a-44af-9e08-6f31b04c2eb4 ./wal/blocks/single-tenant/4250beda-c00a-44af-9e08-6f31b04c2eb4/bloom-0 ./wal/blocks/single-tenant/4250beda-c00a-44af-9e08-6f31b04c2eb4/data.parquet ./wal/blocks/single-tenant/4250beda-c00a-44af-9e08-6f31b04c2eb4/meta.json ./wal/blocks/single-tenant/93c4b634-71d8-4352-b232-d4d8a23add36 ./wal/blocks/single-tenant/93c4b634-71d8-4352-b232-d4d8a23add36/bloom-0 ./wal/blocks/single-tenant/93c4b634-71d8-4352-b232-d4d8a23add36/data.parquet ./wal/blocks/single-tenant/93c4b634-71d8-4352-b232-d4d8a23add36/meta.json ./wal/blocks/single-tenant/faa68155-62c2-4a12-84ac-3398b02ab442 ./wal/blocks/single-tenant/faa68155-62c2-4a12-84ac-3398b02ab442/bloom-0 ./wal/blocks/single-tenant/faa68155-62c2-4a12-84ac-3398b02ab442/data.parquet ./wal/blocks/single-tenant/faa68155-62c2-4a12-84ac-3398b02ab442/meta.json ./wal/blocks/single-tenant/67a22a1b-1556-47cf-b6ac-6709e12df095 ./wal/blocks/single-tenant/67a22a1b-1556-47cf-b6ac-6709e12df095/bloom-0 ./wal/blocks/single-tenant/67a22a1b-1556-47cf-b6ac-6709e12df095/flushed ./wal/blocks/single-tenant/67a22a1b-1556-47cf-b6ac-6709e12df095/data.parquet ./wal/blocks/single-tenant/67a22a1b-1556-47cf-b6ac-6709e12df095/meta.json ./wal/blocks/single-tenant/f1ca99f3-f4a3-4ca5-b4d4-74f56086f510 ./wal/blocks/single-tenant/f1ca99f3-f4a3-4ca5-b4d4-74f56086f510/bloom-0 ./wal/blocks/single-tenant/f1ca99f3-f4a3-4ca5-b4d4-74f56086f510/data.parquet ./wal/blocks/single-tenant/f1ca99f3-f4a3-4ca5-b4d4-74f56086f510/meta.json ./wal/blocks/single-tenant/7b1a77ad-22a1-4ba8-aea2-0901ec3b16c1 ./wal/blocks/single-tenant/7b1a77ad-22a1-4ba8-aea2-0901ec3b16c1/bloom-0 ./wal/blocks/single-tenant/7b1a77ad-22a1-4ba8-aea2-0901ec3b16c1/data.parquet ./wal/blocks/single-tenant/7b1a77ad-22a1-4ba8-aea2-0901ec3b16c1/meta.json ./wal/blocks/single-tenant/20d19039-e5e9-40b0-ac70-0899131a6650 ./wal/blocks/single-tenant/20d19039-e5e9-40b0-ac70-0899131a6650/bloom-0 ./wal/blocks/single-tenant/20d19039-e5e9-40b0-ac70-0899131a6650/data.parquet ./wal/blocks/single-tenant/20d19039-e5e9-40b0-ac70-0899131a6650/meta.json ./wal/blocks/single-tenant/c89475a4-3e44-4101-8eda-a83151e9dfae ./wal/blocks/single-tenant/c89475a4-3e44-4101-8eda-a83151e9dfae/bloom-0 ./wal/blocks/single-tenant/c89475a4-3e44-4101-8eda-a83151e9dfae/data.parquet ./wal/blocks/single-tenant/c89475a4-3e44-4101-8eda-a83151e9dfae/meta.json ./wal/blocks/single-tenant/89974822-2a94-4fc9-8718-e25e2a0953ee ./wal/blocks/single-tenant/89974822-2a94-4fc9-8718-e25e2a0953ee/bloom-0 ./wal/blocks/single-tenant/89974822-2a94-4fc9-8718-e25e2a0953ee/data.parquet ./wal/blocks/single-tenant/89974822-2a94-4fc9-8718-e25e2a0953ee/meta.json ./wal/blocks/single-tenant/f96d85b3-8d20-48ed-bd4e-07b47ed86199 ./wal/blocks/single-tenant/f96d85b3-8d20-48ed-bd4e-07b47ed86199/bloom-0 ./wal/blocks/single-tenant/f96d85b3-8d20-48ed-bd4e-07b47ed86199/data.parquet ./wal/blocks/single-tenant/f96d85b3-8d20-48ed-bd4e-07b47ed86199/meta.json ./wal/blocks/single-tenant/42e0c056-bd59-440f-afb4-ac998a9c4c07 ./wal/blocks/single-tenant/42e0c056-bd59-440f-afb4-ac998a9c4c07/bloom-0 ./wal/blocks/single-tenant/42e0c056-bd59-440f-afb4-ac998a9c4c07/data.parquet ./wal/blocks/single-tenant/42e0c056-bd59-440f-afb4-ac998a9c4c07/meta.json ./wal/blocks/single-tenant/8372243e-288a-427c-a1ec-d1f00564ee0f ./wal/blocks/single-tenant/8372243e-288a-427c-a1ec-d1f00564ee0f/bloom-0 ./wal/blocks/single-tenant/8372243e-288a-427c-a1ec-d1f00564ee0f/data.parquet ./wal/blocks/single-tenant/8372243e-288a-427c-a1ec-d1f00564ee0f/meta.json ./wal/blocks/single-tenant/76142077-fe68-4371-8a7d-854d2e975aad ./wal/blocks/single-tenant/76142077-fe68-4371-8a7d-854d2e975aad/bloom-0 ./wal/blocks/single-tenant/76142077-fe68-4371-8a7d-854d2e975aad/data.parquet ./wal/blocks/single-tenant/76142077-fe68-4371-8a7d-854d2e975aad/meta.json ./wal/blocks/single-tenant/bd4934b3-69b2-45fc-b217-fa7cae0391df ./wal/blocks/single-tenant/bd4934b3-69b2-45fc-b217-fa7cae0391df/bloom-0 ./wal/blocks/single-tenant/bd4934b3-69b2-45fc-b217-fa7cae0391df/data.parquet ./wal/blocks/single-tenant/bd4934b3-69b2-45fc-b217-fa7cae0391df/meta.json ./wal/blocks/single-tenant/3a81f841-2e1d-4a84-9eca-dce70a7cc5c8 ./wal/blocks/single-tenant/3a81f841-2e1d-4a84-9eca-dce70a7cc5c8/bloom-0 ./wal/blocks/single-tenant/3a81f841-2e1d-4a84-9eca-dce70a7cc5c8/data.parquet ./wal/blocks/single-tenant/3a81f841-2e1d-4a84-9eca-dce70a7cc5c8/meta.json ./wal/blocks/single-tenant/768fe090-4946-4876-86a0-ad46db5b73ef ./wal/blocks/single-tenant/768fe090-4946-4876-86a0-ad46db5b73ef/bloom-0 ./wal/blocks/single-tenant/768fe090-4946-4876-86a0-ad46db5b73ef/data.parquet ./wal/blocks/single-tenant/768fe090-4946-4876-86a0-ad46db5b73ef/meta.json ./wal/blocks/single-tenant/7bc65ecb-8398-4794-9bea-5a0859fe244c ./wal/blocks/single-tenant/7bc65ecb-8398-4794-9bea-5a0859fe244c/bloom-0 ./wal/blocks/single-tenant/7bc65ecb-8398-4794-9bea-5a0859fe244c/data.parquet ./wal/blocks/single-tenant/7bc65ecb-8398-4794-9bea-5a0859fe244c/meta.json ./wal/blocks/single-tenant/211f66f6-e433-42c4-ac69-bf344b0cf876 ./wal/blocks/single-tenant/211f66f6-e433-42c4-ac69-bf344b0cf876/bloom-0 ./wal/blocks/single-tenant/211f66f6-e433-42c4-ac69-bf344b0cf876/data.parquet ./wal/blocks/single-tenant/211f66f6-e433-42c4-ac69-bf344b0cf876/meta.json ./wal/blocks/single-tenant/742b0f8c-d209-4763-8c2e-5249f3a12b3a ./wal/blocks/single-tenant/742b0f8c-d209-4763-8c2e-5249f3a12b3a/bloom-0 ./wal/blocks/single-tenant/742b0f8c-d209-4763-8c2e-5249f3a12b3a/data.parquet ./wal/blocks/single-tenant/742b0f8c-d209-4763-8c2e-5249f3a12b3a/meta.json ./wal/blocks/single-tenant/64bb59b6-de80-4f52-af7f-98171b898f31 ./wal/blocks/single-tenant/64bb59b6-de80-4f52-af7f-98171b898f31/bloom-0 ./wal/blocks/single-tenant/64bb59b6-de80-4f52-af7f-98171b898f31/flushed ./wal/blocks/single-tenant/64bb59b6-de80-4f52-af7f-98171b898f31/data.parquet ./wal/blocks/single-tenant/64bb59b6-de80-4f52-af7f-98171b898f31/meta.json ./wal/blocks/single-tenant/777ab23c-d7a8-45bd-8a2f-5254541a173e ./wal/blocks/single-tenant/777ab23c-d7a8-45bd-8a2f-5254541a173e/bloom-0 ./wal/blocks/single-tenant/777ab23c-d7a8-45bd-8a2f-5254541a173e/data.parquet ./wal/blocks/single-tenant/777ab23c-d7a8-45bd-8a2f-5254541a173e/meta.json ./wal/blocks/single-tenant/fe7cb40f-73cb-4bd7-b2cb-80c1c6def61f ./wal/blocks/single-tenant/fe7cb40f-73cb-4bd7-b2cb-80c1c6def61f/bloom-0 ./wal/blocks/single-tenant/fe7cb40f-73cb-4bd7-b2cb-80c1c6def61f/data.parquet ./wal/blocks/single-tenant/fe7cb40f-73cb-4bd7-b2cb-80c1c6def61f/meta.json ./wal/blocks/single-tenant/46f8f48e-7836-452f-90b7-47add144b00c ./wal/blocks/single-tenant/46f8f48e-7836-452f-90b7-47add144b00c/bloom-0 ./wal/blocks/single-tenant/46f8f48e-7836-452f-90b7-47add144b00c/data.parquet ./wal/blocks/single-tenant/46f8f48e-7836-452f-90b7-47add144b00c/meta.json ./wal/blocks/single-tenant/0c6b2aee-50dc-4d23-a7bf-7a6203d598aa ./wal/blocks/single-tenant/0c6b2aee-50dc-4d23-a7bf-7a6203d598aa/bloom-0 ./wal/blocks/single-tenant/0c6b2aee-50dc-4d23-a7bf-7a6203d598aa/data.parquet ./wal/blocks/single-tenant/0c6b2aee-50dc-4d23-a7bf-7a6203d598aa/meta.json ./wal/blocks/single-tenant/5f0b46ac-6345-4b86-af63-c2ab6a8157d8 ./wal/blocks/single-tenant/5f0b46ac-6345-4b86-af63-c2ab6a8157d8/bloom-0 ./wal/blocks/single-tenant/5f0b46ac-6345-4b86-af63-c2ab6a8157d8/data.parquet ./wal/blocks/single-tenant/5f0b46ac-6345-4b86-af63-c2ab6a8157d8/meta.json ./wal/blocks/single-tenant/738de45d-afb2-4ab0-9968-382e671120cd ./wal/blocks/single-tenant/738de45d-afb2-4ab0-9968-382e671120cd/bloom-0 ./wal/blocks/single-tenant/738de45d-afb2-4ab0-9968-382e671120cd/data.parquet ./wal/blocks/single-tenant/738de45d-afb2-4ab0-9968-382e671120cd/meta.json ./wal/blocks/single-tenant/20a22584-7690-433f-96fc-1823bd63095f ./wal/blocks/single-tenant/20a22584-7690-433f-96fc-1823bd63095f/bloom-0 ./wal/blocks/single-tenant/20a22584-7690-433f-96fc-1823bd63095f/data.parquet ./wal/blocks/single-tenant/20a22584-7690-433f-96fc-1823bd63095f/meta.json ./wal/blocks/single-tenant/ef619183-a3c0-4626-b184-0d49f2e13d41 ./wal/blocks/single-tenant/ef619183-a3c0-4626-b184-0d49f2e13d41/bloom-0 ./wal/blocks/single-tenant/ef619183-a3c0-4626-b184-0d49f2e13d41/data.parquet ./wal/blocks/single-tenant/ef619183-a3c0-4626-b184-0d49f2e13d41/meta.json ./wal/blocks/single-tenant/561cbe9c-d091-4a29-9467-afe6caa4e445 ./wal/blocks/single-tenant/561cbe9c-d091-4a29-9467-afe6caa4e445/bloom-0 ./wal/blocks/single-tenant/561cbe9c-d091-4a29-9467-afe6caa4e445/data.parquet ./wal/blocks/single-tenant/561cbe9c-d091-4a29-9467-afe6caa4e445/meta.json ./wal/blocks/single-tenant/909a4250-e879-4400-ad82-844a1a65db8a ./wal/blocks/single-tenant/909a4250-e879-4400-ad82-844a1a65db8a/data.parquet ./wal/blocks/single-tenant/db097d01-4c85-4311-a8f7-43546f2bf56e ./wal/blocks/single-tenant/db097d01-4c85-4311-a8f7-43546f2bf56e/bloom-0 ./wal/blocks/single-tenant/db097d01-4c85-4311-a8f7-43546f2bf56e/data.parquet ./wal/blocks/single-tenant/db097d01-4c85-4311-a8f7-43546f2bf56e/meta.json ./wal/blocks/single-tenant/38279e09-07fa-44ac-87e5-51cc9126c5a3 ./wal/blocks/single-tenant/38279e09-07fa-44ac-87e5-51cc9126c5a3/bloom-0 ./wal/blocks/single-tenant/38279e09-07fa-44ac-87e5-51cc9126c5a3/data.parquet ./wal/blocks/single-tenant/38279e09-07fa-44ac-87e5-51cc9126c5a3/meta.json ./wal/blocks/single-tenant/2a506de2-7d8e-428a-96a8-1c5173e18a91 ./wal/blocks/single-tenant/2a506de2-7d8e-428a-96a8-1c5173e18a91/bloom-0 ./wal/blocks/single-tenant/2a506de2-7d8e-428a-96a8-1c5173e18a91/data.parquet ./wal/blocks/single-tenant/2a506de2-7d8e-428a-96a8-1c5173e18a91/meta.json ./wal/blocks/single-tenant/42159a22-1cd0-4784-b3b8-6980f0b6b520 ./wal/blocks/single-tenant/42159a22-1cd0-4784-b3b8-6980f0b6b520/bloom-0 ./wal/blocks/single-tenant/42159a22-1cd0-4784-b3b8-6980f0b6b520/data.parquet ./wal/blocks/single-tenant/42159a22-1cd0-4784-b3b8-6980f0b6b520/meta.json ./wal/blocks/single-tenant/309b31fe-52e7-4f38-ae0d-38b56131d06d ./wal/blocks/single-tenant/309b31fe-52e7-4f38-ae0d-38b56131d06d/bloom-0 ./wal/blocks/single-tenant/309b31fe-52e7-4f38-ae0d-38b56131d06d/data.parquet ./wal/blocks/single-tenant/309b31fe-52e7-4f38-ae0d-38b56131d06d/meta.json ./wal/blocks/single-tenant/bcf4165e-897a-4e01-afcf-7b8274bcb1b5 ./wal/blocks/single-tenant/bcf4165e-897a-4e01-afcf-7b8274bcb1b5/bloom-0 ./wal/blocks/single-tenant/bcf4165e-897a-4e01-afcf-7b8274bcb1b5/data.parquet ./wal/blocks/single-tenant/bcf4165e-897a-4e01-afcf-7b8274bcb1b5/meta.json ./wal/blocks/single-tenant/7b4bd45a-1821-419f-bf2b-70db4832acf5 ./wal/blocks/single-tenant/7b4bd45a-1821-419f-bf2b-70db4832acf5/bloom-0 ./wal/blocks/single-tenant/7b4bd45a-1821-419f-bf2b-70db4832acf5/data.parquet ./wal/blocks/single-tenant/7b4bd45a-1821-419f-bf2b-70db4832acf5/meta.json ./wal/9ffeacf1-8a26-4d10-8b7a-3bdec87c9cb3+single-tenant+vParquet ./wal/9ffeacf1-8a26-4d10-8b7a-3bdec87c9cb3+single-tenant+vParquet/0000000004 ./wal/9ffeacf1-8a26-4d10-8b7a-3bdec87c9cb3+single-tenant+vParquet/0000000005 ./wal/9ffeacf1-8a26-4d10-8b7a-3bdec87c9cb3+single-tenant+vParquet/0000000002 ./wal/9ffeacf1-8a26-4d10-8b7a-3bdec87c9cb3+single-tenant+vParquet/meta.json ./wal/9ffeacf1-8a26-4d10-8b7a-3bdec87c9cb3+single-tenant+vParquet/0000000001 ./wal/9ffeacf1-8a26-4d10-8b7a-3bdec87c9cb3+single-tenant+vParquet/0000000003 ```

I also tested whether the mounted drive was read only but I was able to delete one of the meta.json files in the blocks

joe-elliott commented 1 year ago

based on settings that might be a "normal" number of complete blocks or it might be exceptionally high. can you check an ingester that is not showing the issue?

liam-howe-maersk commented 1 year ago

I have checked the same ingester that had it's volume filling up earlier after we have deleted the volume and recreated it and currently it is in a healthy state

contents of /var/tempo in previously unhealthy ingester ``` . ./wal ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000012 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000002 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000016 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000001 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000015 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000004 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000011 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000014 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000008 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000003 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000013 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000010 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000007 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000017 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/meta.json ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000005 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000009 ./wal/393743b9-157d-46da-b381-5eb88a99446a+single-tenant+vParquet/0000000006 ./wal/blocks ./wal/blocks/single-tenant ./wal/blocks/single-tenant/8a5bbe68-b921-4ef6-b2a7-93d37f91ad4b ./wal/blocks/single-tenant/8a5bbe68-b921-4ef6-b2a7-93d37f91ad4b/data.parquet ./wal/blocks/single-tenant/8a5bbe68-b921-4ef6-b2a7-93d37f91ad4b/meta.json ./wal/blocks/single-tenant/8a5bbe68-b921-4ef6-b2a7-93d37f91ad4b/bloom-0 ./wal/blocks/single-tenant/8a5bbe68-b921-4ef6-b2a7-93d37f91ad4b/flushed ./wal/blocks/single-tenant/59c25787-3061-448e-a0a4-d406e0120240 ./wal/blocks/single-tenant/59c25787-3061-448e-a0a4-d406e0120240/data.parquet ./wal/blocks/single-tenant/59c25787-3061-448e-a0a4-d406e0120240/meta.json ./wal/blocks/single-tenant/59c25787-3061-448e-a0a4-d406e0120240/bloom-0 ./wal/blocks/single-tenant/59c25787-3061-448e-a0a4-d406e0120240/flushed ./wal/blocks/single-tenant/e2acd834-17d5-4769-bc03-114a1364d960 ./wal/blocks/single-tenant/e2acd834-17d5-4769-bc03-114a1364d960/data.parquet ./wal/blocks/single-tenant/e2acd834-17d5-4769-bc03-114a1364d960/meta.json ./wal/blocks/single-tenant/e2acd834-17d5-4769-bc03-114a1364d960/bloom-0 ./wal/blocks/single-tenant/e2acd834-17d5-4769-bc03-114a1364d960/flushed ./wal/blocks/single-tenant/4f4cfc59-0fea-4ffe-8189-15dca26f91a0 ./wal/blocks/single-tenant/4f4cfc59-0fea-4ffe-8189-15dca26f91a0/data.parquet ./wal/blocks/single-tenant/4f4cfc59-0fea-4ffe-8189-15dca26f91a0/meta.json ./wal/blocks/single-tenant/4f4cfc59-0fea-4ffe-8189-15dca26f91a0/bloom-0 ./wal/blocks/single-tenant/4f4cfc59-0fea-4ffe-8189-15dca26f91a0/flushed ./wal/blocks/single-tenant/e1e61f93-449b-4a48-86e8-23d24b893379 ./wal/blocks/single-tenant/e1e61f93-449b-4a48-86e8-23d24b893379/data.parquet ./wal/blocks/single-tenant/e1e61f93-449b-4a48-86e8-23d24b893379/meta.json ./wal/blocks/single-tenant/e1e61f93-449b-4a48-86e8-23d24b893379/bloom-0 ./wal/blocks/single-tenant/e1e61f93-449b-4a48-86e8-23d24b893379/flushed ./lost+found ```

And for comparison a different ingester that has not experienced any problems with volumes filling up recently

contents of /var/tempo in healthy ingester ``` . ./wal ./wal/blocks ./wal/blocks/single-tenant ./wal/blocks/single-tenant/4ff93478-ebcb-458e-baa0-55a638537958 ./wal/blocks/single-tenant/4ff93478-ebcb-458e-baa0-55a638537958/bloom-0 ./wal/blocks/single-tenant/4ff93478-ebcb-458e-baa0-55a638537958/flushed ./wal/blocks/single-tenant/4ff93478-ebcb-458e-baa0-55a638537958/meta.json ./wal/blocks/single-tenant/4ff93478-ebcb-458e-baa0-55a638537958/data.parquet ./wal/blocks/single-tenant/abfaf2d2-91eb-45ca-9eec-f5e44849ae85 ./wal/blocks/single-tenant/abfaf2d2-91eb-45ca-9eec-f5e44849ae85/bloom-0 ./wal/blocks/single-tenant/abfaf2d2-91eb-45ca-9eec-f5e44849ae85/flushed ./wal/blocks/single-tenant/abfaf2d2-91eb-45ca-9eec-f5e44849ae85/meta.json ./wal/blocks/single-tenant/abfaf2d2-91eb-45ca-9eec-f5e44849ae85/data.parquet ./wal/blocks/single-tenant/bf1faadf-fc28-4878-877b-bc43177863dc ./wal/blocks/single-tenant/bf1faadf-fc28-4878-877b-bc43177863dc/bloom-0 ./wal/blocks/single-tenant/bf1faadf-fc28-4878-877b-bc43177863dc/flushed ./wal/blocks/single-tenant/bf1faadf-fc28-4878-877b-bc43177863dc/meta.json ./wal/blocks/single-tenant/bf1faadf-fc28-4878-877b-bc43177863dc/data.parquet ./wal/blocks/single-tenant/a699d60f-b1c1-4dc5-8009-8c0f0b3a0aaa ./wal/blocks/single-tenant/a699d60f-b1c1-4dc5-8009-8c0f0b3a0aaa/bloom-0 ./wal/blocks/single-tenant/a699d60f-b1c1-4dc5-8009-8c0f0b3a0aaa/flushed ./wal/blocks/single-tenant/a699d60f-b1c1-4dc5-8009-8c0f0b3a0aaa/meta.json ./wal/blocks/single-tenant/a699d60f-b1c1-4dc5-8009-8c0f0b3a0aaa/data.parquet ./wal/blocks/single-tenant/4d658a9d-aaee-477c-bd10-9a5e94a1c1f7 ./wal/blocks/single-tenant/4d658a9d-aaee-477c-bd10-9a5e94a1c1f7/bloom-0 ./wal/blocks/single-tenant/4d658a9d-aaee-477c-bd10-9a5e94a1c1f7/flushed ./wal/blocks/single-tenant/4d658a9d-aaee-477c-bd10-9a5e94a1c1f7/meta.json ./wal/blocks/single-tenant/4d658a9d-aaee-477c-bd10-9a5e94a1c1f7/data.parquet ./wal/b89f99bb-3958-4773-b2f1-263285af194a+single-tenant+vParquet ./wal/b89f99bb-3958-4773-b2f1-263285af194a+single-tenant+vParquet/0000000001 ./wal/b89f99bb-3958-4773-b2f1-263285af194a+single-tenant+vParquet/0000000006 ./wal/b89f99bb-3958-4773-b2f1-263285af194a+single-tenant+vParquet/0000000003 ./wal/b89f99bb-3958-4773-b2f1-263285af194a+single-tenant+vParquet/0000000005 ./wal/b89f99bb-3958-4773-b2f1-263285af194a+single-tenant+vParquet/meta.json ./wal/b89f99bb-3958-4773-b2f1-263285af194a+single-tenant+vParquet/0000000007 ./wal/b89f99bb-3958-4773-b2f1-263285af194a+single-tenant+vParquet/0000000004 ./wal/b89f99bb-3958-4773-b2f1-263285af194a+single-tenant+vParquet/0000000008 ./wal/b89f99bb-3958-4773-b2f1-263285af194a+single-tenant+vParquet/0000000002 ./wal/b89f99bb-3958-4773-b2f1-263285af194a+single-tenant+vParquet/0000000009 ./lost+found ```
joe-elliott commented 1 year ago

Ok, so we know that flushes are slowing down which is causing the completed blocks on disk to pile up and then eventually fill the disk. We also can't find any relevant logs that might indicate there is some obvious error. For fun we could confirm this by checking the tempo_ingester_flush_duration_seconds_bucket histogram or reviewing the logs.

My guess is that some resource is being saturated during the time period that Tempo is struggling to offload complete blocks from the ingesters. Above you indicated there was a spike in memory usage at the time that tempo was having issues. This is another potential indicator that some resource is saturated.

Things to try/look for:

I also see heap increasing. This has me concerned that there is bug where something in the ingester is spiraling out of control and doing a ton of unnecessary work. Things we can check:

I apologize for all the homework but you've stumbled on a tough one. I'm more than willing to work through this with you though!

LasseHels commented 1 year ago

Hey @joe-elliott.

We are still experiencing this issue, and are still trying to figure out the root cause.

I'm going to be dumping different relevant (I hope) metrics from a couple of examples we had recently:

ingester-zone-a-4 which started filling up at 2023-03-22T16:30:00Z and was full at 2023-03-02T03:24:00Z PVC filling up: ![Screenshot 2023-04-05 at 12 27 52](https://user-images.githubusercontent.com/121869045/230054884-ef488e4d-70c1-4a6b-ae36-081d82e89601.png) Flushes slowed down: ![Screenshot 2023-04-05 at 12 27 08](https://user-images.githubusercontent.com/121869045/230054732-a241a956-5c3a-4199-8cdc-4ec648fc8b1e.png) Number of goroutines looks fairly steady: ![Screenshot 2023-04-05 at 12 28 49](https://user-images.githubusercontent.com/121869045/230055097-be98f014-4b47-422c-8869-5f9d10b9711c.png) Request rate: ![Screenshot 2023-04-05 at 12 30 32](https://user-images.githubusercontent.com/121869045/230055495-d2192ba3-4e7c-421b-814b-5cf0ee81009c.png) Traces created: ![Screenshot 2023-04-05 at 12 43 47](https://user-images.githubusercontent.com/121869045/230058286-d1ddd180-e7b8-4b91-b240-91db467dd1f4.png) Flush queue length grows: ![Screenshot 2023-04-05 at 12 44 53](https://user-images.githubusercontent.com/121869045/230058505-a217f0b2-eaa6-44a8-b4a3-e5414355c011.png)
ingester-zone-a-3 which started filling up at 2023-03-27T06:00:00Z and was full at 2023-03-27T14:20:00Z PVC filling up: ![Screenshot 2023-04-05 at 12 49 31](https://user-images.githubusercontent.com/121869045/230059402-1a7ea2a7-0c68-4b58-b9d4-559c382b4175.png) Flushes slowed down: ![Screenshot 2023-04-05 at 12 50 31](https://user-images.githubusercontent.com/121869045/230059597-802dbfc5-05de-49af-9590-619d878a627f.png) There are some spikes in the number of goroutines, but otherwise looks fairly steady: ![Screenshot 2023-04-05 at 12 51 10](https://user-images.githubusercontent.com/121869045/230059725-d8aa300d-8d62-4694-b1b5-570193f6e5e6.png) Request rate: ![Screenshot 2023-04-05 at 12 52 34](https://user-images.githubusercontent.com/121869045/230060013-f223f8a0-6eac-4697-8702-20143a9ccbc2.png) Traces created: ![Screenshot 2023-04-05 at 12 53 10](https://user-images.githubusercontent.com/121869045/230060164-5ab6a7a1-fb48-4dcd-986d-f7e8e3de9789.png) Flush queue length grows: ![Screenshot 2023-04-05 at 12 53 39](https://user-images.githubusercontent.com/121869045/230060270-59e86dec-08f3-4802-919c-17e5d8fe3271.png)

tempo has pprof enabled by default. if you can provide a memory profile of a healthy and unhealthy ingester i'd gladly dig into it.

Roger that, we will try to catch a live one and report back with some pprof output.

LasseHels commented 1 year ago

@joe-elliott A couple more things: Our usual runbook for this alert is to delete both the pod and the PVC, but we discovered that, if caught early, the issue can be fixed by deleting the pod without deleting the PVC: 228525170-7cf9a2d2-1ead-4b04-9a97-b573cf957ab7


use node metrics to look for network/disk saturation.

We use Goldpinger to monitor our nodes, and have been unable to find a link between node issues and PVCs filling up.


Our maximum trace size is rather large at 40MB. Even with this limit, our ingesters are still emitting a lot of TRACE_TOO_LARGE warnings (note the size of the Y-axis): Screenshot 2023-04-05 at 13 21 02

We have been wondering - given that our limit is significantly higher than the default - whether lowering this limit could have a positive impact?

There is some amount of politics involved with lowering the limit, so we want to have a plausible reason before lowering it.

LasseHels commented 1 year ago

@joe-elliott

tempo has pprof enabled by default. if you can provide a memory profile of a healthy and unhealthy ingester i'd gladly dig into it.

Here is some pprof debug data from a healthy and faulty ingester:

Note that the faulty ingester that the debug data was captured from was somewhat atypical, in that it took more than 24 hours to fill up: Screenshot 2023-04-05 at 14 20 58

I'm unsure if this makes the data more or less useful.

joe-elliott commented 1 year ago

Our usual runbook for this alert is to delete both the pod and the PVC, but we discovered that, if caught early, the issue can be fixed by deleting the pod without deleting the PVC:

Do you know if the pod moved nodes after it was deleted? I'm still suspicious of resource throttling at the node level. If this pod moved nodes when it was deleted it might suggest something along these lines. Does goldpinger give any indication of resource usage on the nodes? I would recommend using something like node exporter to dump node metrics.

Our maximum trace size is rather large at 40MB. Even with this limit, our ingesters are still emitting a lot of TRACE_TOO_LARGE warnings (note the size of the Y-axis):

We have a max size of 50MB and do not see this issue. Lowering the max size will improve ingester and compactor memory profiles. Humorously, we too face politics on lowering the limit :). This PR would improve things w/r to max trace size quite a bit but I have not found time to go back at complete it: https://github.com/grafana/tempo/pull/1962

Here is some pprof debug data from a healthy and faulty ingester:

I will try to find some time to dig into this deeper, but it would be easier if you were able to pass the binary files retrieved from pprof directly. I took a quick look and didn't see anything out of the ordinary.

Interested to see if any of these metrics are showing anything suspicious: tempo_ingester_flush_retries_total tempo_ingester_flush_failed_retries_total tempo_ingester_flush_duration_seconds

are the flushes taking considerably longer during this period?

LasseHels commented 1 year ago

@joe-elliott

Do you know if the pod moved nodes after it was deleted?

I don't know about the concrete pod in the example earlier.

However, I did run a quick test; PVC ingester-data-ingester-zone-b-0 has been slowly filling up over the course of a week: Screenshot 2023-04-11 at 15 08 56

During all of this time, the ingester has been running on the same node.

I cordoned the ingester's node and deleted the ingester pod, forcing it onto a new node. Immediately after doing this, the PVC's space started freeing again: Screenshot 2023-04-11 at 15 11 22

After letting the PVC empty, I deleted the ingester once again, putting it back on its original node. Once placed there, the ingester exhibited no faulty behaviour, and continued operating with no issues.

Does goldpinger give any indication of resource usage on the nodes?

No, but as you also suggest, we use node_exporter for that.

CPU, memory and network are fairly consistent across Tempo nodes, but there is definitely some funny business going on with disk IO: Screenshot 2023-04-11 at 15 20 31

Above metrics are from a node exhibiting poor PVC behaviour. Note the Y-axis of Disk IO Saturation. The cyan colour in both diagrams represents sdd.

Compare that with two nodes on their best behaviour: Screenshot 2023-04-11 at 15 21 01

Screenshot 2023-04-11 at 15 21 53

My current working theory is that PVC's exhibit this behaviour when their underlying Azure premium SSD managed disk runs out of burst credits. Preliminary analysis consisting of digging into metrics from Azure has been unable to confirm this theory.

I will try to find some time to dig into this deeper, but it would be easier if you were able to pass the binary files retrieved from pprof directly. I took a quick look and didn't see anything out of the ordinary.

Roger that. We will try to get some binary files next time we watch the issue happening.

Interested to see if any of these metrics are showing anything suspicious: tempo_ingester_flush_retries_total tempo_ingester_flush_failed_retries_total tempo_ingester_flush_duration_seconds

are the flushes taking considerably longer during this period?

Doesn't look like it, though the frequency of flushes seems to go down, as we have previously discussed: PVC filling up: Screenshot 2023-04-11 at 15 04 38

tempo_ingester_flush_retries_total: Screenshot 2023-04-11 at 15 05 11

tempo_ingester_flush_failed_retries_total: Screenshot 2023-04-11 at 15 05 41

tempo_ingester_flush_duration_seconds Screenshot 2023-04-11 at 15 06 07

Screenshot 2023-04-11 at 15 06 23

joe-elliott commented 1 year ago

Ok, I think things are coming together. Flushes aren't slowing down. The time it takes to cut a block is likely slowing down to due to disk saturation.

My current working theory is that PVC's exhibit this behaviour when their underlying Azure premium SSD managed disk runs out of burst credits. Preliminary analysis consisting of digging into metrics from Azure has been unable to confirm this theory.

I do not know much about Azure SSD's. In our internal cluster we provision 50GB SSDs per ingester but really only use ~3-5GB's. We used to use 15GB SSDs but found that we had to increase their size in order to get more guaranteed iops. If there is a similar relationship between size and iops in Azure perhaps just bumping the disks up will solve the problem?

zhuoyuan-liu commented 1 year ago

Hi @joe-elliott

We found another interesting behavior about this issue: It failed to flush blocks to the storage account every other time.

From the log, we can see that it successfully cut out the blocks for all blockID but it does not revoke the handleFlush methods. We know this because the relevant metrics like tempo_ingester_blocks_flushed_total do not increase and we do not see any error message. Unfortunately, we have not been able to identify why the handleFlush function is not executing.

2023-03-22 23:37:48 | level=info ts=2023-03-22T22:37:48.076276342Z caller=flush.go:251 msg="block completed" userid=single-tenant blockID=1ac27c15-6c95-4603-b311-55cf7a2e8282 duration=29.117754291s |  
  |   | 2023-03-22 23:37:18 | level=info ts=2023-03-22T22:37:18.958522601Z caller=flush.go:244 msg="completing block" userid=single-tenant blockID=1ac27c15-6c95-4603-b311-55cf7a2e8282 |  
  |   | 2023-03-22 23:37:16 | level=info ts=2023-03-22T22:37:16.744301555Z caller=flush.go:168 msg="head block cut. enqueueing flush op" userid=single-tenant block=1ac27c15-6c95-4603-b311-55cf7a2e8282 |  
  |   | 2023-03-22 23:29:42 | level=info ts=2023-03-22T22:29:42.192528236Z caller=flush.go:300 msg="flushing block" userid=single-tenant block=4d1bb73b-3248-4b01-922a-848b17397a5d |  
  |   | 2023-03-22 23:29:42 | level=info ts=2023-03-22T22:29:42.192183894Z caller=flush.go:251 msg="block completed" userid=single-tenant blockID=4d1bb73b-3248-4b01-922a-848b17397a5d duration=29.321009171s |  
  |   | 2023-03-22 23:29:12 | level=info ts=2023-03-22T22:29:12.871183779Z caller=flush.go:244 msg="completing block" userid=single-tenant blockID=4d1bb73b-3248-4b01-922a-848b17397a5d |  
  |   | 2023-03-22 23:29:06 | level=info ts=2023-03-22T22:29:06.979183755Z caller=flush.go:168 msg="head block cut. enqueueing flush op" userid=single-tenant block=4d1bb73b-3248-4b01-922a-848b17397a5d |  
  |   | 2023-03-22 23:21:31 | level=info ts=2023-03-22T22:21:31.620497354Z caller=flush.go:251 msg="block completed" userid=single-tenant blockID=fff5d5a3-9ec3-4278-bd39-272837020f39 duration=28.924166793s |  
  |   | 2023-03-22 23:21:02 | level=info ts=2023-03-22T22:21:02.696329719Z caller=flush.go:244 msg="completing block" userid=single-tenant blockID=fff5d5a3-9ec3-4278-bd39-272837020f39 |  
  |   | 2023-03-22 23:20:56 | level=info ts=2023-03-22T22:20:56.790034003Z caller=flush.go:168 msg="head block cut. enqueueing flush op" userid=single-tenant block=fff5d5a3-9ec3-4278-bd39-272837020f39 |  
  |   | 2023-03-22 23:13:28 | level=info ts=2023-03-22T22:13:28.591270913Z caller=flush.go:300 msg="flushing block" userid=single-tenant block=6a375746-34ea-4169-9bd4-af2adf8c6534 |  
  |   | 2023-03-22 23:13:28 | level=info ts=2023-03-22T22:13:28.59091572Z caller=flush.go:251 msg="block completed" userid=single-tenant blockID=6a375746-34ea-4169-9bd4-af2adf8c6534 duration=30.715002229s |  
  |   | 2023-03-22 23:12:57 | level=info ts=2023-03-22T22:12:57.875914995Z caller=flush.go:244 msg="completing block" userid=single-tenant blockID=6a375746-34ea-4169-9bd4-af2adf8c6534 |  
  |   | 2023-03-22 23:12:56 | level=info ts=2023-03-22T22:12:56.920054167Z caller=flush.go:168 msg="head block cut. enqueueing flush op" userid=single-tenant block=6a375746-34ea-4169-9bd4-af2adf8c6534 |  
  |   | 2023-03-22 23:04:40 | level=info ts=2023-03-22T22:04:40.318653081Z caller=flush.go:251 msg="block completed" userid=single-tenant blockID=c9fca8ba-1641-4da9-a38b-1a9c9519fa51 duration=28.010166677s |  
  |   | 2023-03-22 23:04:12 | level=info ts=2023-03-22T22:04:12.308485353Z caller=flush.go:244 msg="completing block" userid=single-tenant blockID=c9fca8ba-1641-4da9-a38b-1a9c9519fa51 |  
  |   | 2023-03-22 23:04:06 | level=info ts=2023-03-22T22:04:06.973886986Z caller=flush.go:168 msg="head block cut. enqueueing flush op" userid=single-tenant block=c9fca8ba-1641-4da9-a38b-1a9c9519fa51 |  
  |   | 2023-03-22 22:56:31 | level=info ts=2023-03-22T21:56:31.691333475Z caller=flush.go:300 msg="flushing block" userid=single-tenant block=62a1d924-075d-452b-b083-dbfb8a55fe7a |  
  |   | 2023-03-22 22:56:31 | level=info ts=2023-03-22T21:56:31.690933869Z caller=flush.go:251 msg="block completed" userid=single-tenant blockID=62a1d924-075d-452b-b083-dbfb8a55fe7a duration=29.540496507s |  
  |   | 2023-03-22 22:56:02 | level=info ts=2023-03-22T21:56:02.150438145Z caller=flush.go:244 msg="completing block" userid=single-tenant blockID=62a1d924-075d-452b-b083-dbfb8a55fe7a |  
  |   | 2023-03-22 22:55:56 | level=info ts=2023-03-22T21:55:56.616548626Z caller=flush.go:168 msg="head block cut. enqueueing flush op" userid=single-tenant block=62a1d924-075d-452b-b083-dbfb8a55fe7a
LasseHels commented 1 year ago

@joe-elliott

I do not know much about Azure SSD's. In our internal cluster we provision 50GB SSDs per ingester but really only use ~3-5GB's. We used to use 15GB SSDs but found that we had to increase their size in order to get more guaranteed iops. If there is a similar relationship between size and iops in Azure perhaps just bumping the disks up will solve the problem?

With the information available at this time, I believe this is our best bet. It's always frustrating to make this kind of change without being 100% sure that it addresses the root cause of the issue, but data does seem to point towards disk IOPS being a (the) limiting factor here.

We are going to bump our disks up a performance tier or two and analyse the effect it has on the PVC issue.

joe-elliott commented 1 year ago

We found another interesting behavior about this issue: It failed to flush blocks to the storage account every other time.

I wonder if this is some strange artifact related to disk saturation? One block gets stalled out but the next goes through?

We are going to bump our disks up a performance tier or two and analyse the effect it has on the PVC issue.

good luck :)! let us know how it goes

LasseHels commented 1 year ago

@joe-elliott

We increased the size of our ingesters on April 13th, and were initially hopeful that it had resolved the issue, as we did not observe the issue for several days. Unfortunately, a PVC started filling up again yesterday.

PVC filling up: Screenshot 2023-04-19 at 13 21 06

Node metrics: Screenshot 2023-04-19 at 13 21 20 Screenshot 2023-04-19 at 13 21 29

Flushes: Screenshot 2023-04-19 at 13 22 38

I deleted the pod (but not the PVC), and even though the pod was re-scheduled in the same node, the PVC immediately recovered: Screenshot 2023-04-19 at 14 59 51

The red arrow (showcasing my sublime design skills) indicates when the pod was deleted.

Since the PVC was able to recover on the same node that the issue originally appeared on, I guess we can conclude that the node itself is not the issue?

I grabbed some pprof data as you requested: faulty.zip

I'm not familiar with the pprof tool, so I hope the data is correct this time.

For reference, here is data from a healthy ingester: healthy.zip

A curious observation is that when a PVC does start filling up, the rate with which it fills varies a lot. This one has been slowly filling for two days, other ones take only a few hours to fill completely.

Another theory we were working with is that the PVC issues are somehow linked to surges of traffic. We introduced trace sampling a couple of weeks ago, which drastically reduced the amount of spans being created per second, as can be seen here: Screenshot 2023-04-19 at 15 06 20

The fact that the issue is still happening despite traffic being consistently low leads us to believe that traffic is not the root cause of the issue.

As always, any help is appreciated.

joe-elliott commented 1 year ago

So the issue still occurs but seems like it just takes longer to trigger? Is there anyway to visualize azure pvc burst credits to see if they are running out?

LasseHels commented 1 year ago

@joe-elliott

So the issue still occurs but seems like it just takes longer to trigger?

I realise my previous statement was unclear. What I meant to say was:

Is there anyway to visualize azure pvc burst credits to see if they are running out?

Yes, Azure offers several metrics related to disk usage. All the following tables cover the last seven days. As a reminder, the PVC in question suffered issues between 2023-04-18T09:15:00Z and 2023-04-19T12:00:00Z where it was manually remediated.

Data Disk IOPS Consumed Percentage: Screenshot 2023-04-21 at 11 11 56

If we were having issues with disk IOPS, then I'd expect this metric to be a smoking gun. As we can see, it never even reaches above 1% on average. I checked the maximum aggregation as well, where it briefly caps out at 6%.

Data Disk Bandwidth Consumed Percentage is also not an issue: Screenshot 2023-04-21 at 11 13 35

Data Disk Queue Depth also looks okay (note that the table shows the maximum aggregation and not average): Screenshot 2023-04-21 at 11 14 14

Average read operations per second: Screenshot 2023-04-21 at 11 15 00

Average write operations per second: Screenshot 2023-04-21 at 11 15 39

Going through disk metrics in Azure, I don't see any indication that disks are the bottleneck.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

LasseHels commented 1 year ago

I see that the bot is eager to close this issue, so I'll chime in with an update.

We never managed to get to the bottom of this issue, and we are still not sure exactly where it lies.

As previously mentioned, the rate with which a PVC fills up decreased dramatically after bumping our Azure disks from 20GB to 64GB. The PVCs now fill up so slowly that it is rather easy for us to catch it before it becomes critical.

In practice, we have two warnings on PVC usage:

We previously did not have the warning alert, which meant that our on-call personnel would occasionally get paged during the night with a very frustrating PVC alert. We introduced the 50% utilisation alert to catch erroring PVCs before they reach the critical alert threshold. Since it takes days for an erroring PVC to reach the critical threshold, the warning alert makes it easier for us to catch erroring PVCs and fix them.

Naturally, this is not a satisfying conclusion to the issue, but this approach has bought our engineering team some valuable time and allowed us to focus on other more important problems.

joe-elliott commented 1 year ago

Marking this "keepalive". Fwiw we still have never seen this issue in any of our clusters (including our Azure clusters). Unsure if this will help but here is our PVC:

$ kc describe pvc ingester-data-ingester-0
Name:          ingester-data-ingester-0
Namespace:     tempo-prod-09
StorageClass:  fast
Status:        Bound
Volume:        pvc-5ef35f1b-6e56-408b-b9cf-9fe7a6c449b9
Labels:        app=ingester
               name=ingester
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: disk.csi.azure.com
               volume.kubernetes.io/selected-node: aks-medium1-38800088-vmss0000v6
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      100Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       ingester-0
Events:        <none>
LasseHels commented 1 year ago

@joe-elliott Thanks for sharing that information!

Are you able to share what performance tier your Azure disks run with?

We use 64GiB with performance tier P6: Screenshot 2023-06-22 at 08 19 06

While performance tier of the disk is tied to the disk's size by default, I believe it is possible to change the performance tier independently of size (i.e. to run a 100GiB disk with performance tier P15, for example). I'd be curious to know if you are doing that.

joe-elliott commented 1 year ago

Here is one of our disks. P10?

image

LasseHels commented 1 year ago

@joe-elliott Eureka! After an unholy amount of headscratching, I may finally be wise as to what's going on here.

Here is the summary:

  1. Blocks are not being cleared from Tempo's PVC because
  2. Blocks are not being flushed because
  3. One of the ingester's flush queues stops processing flush operations because
  4. A flush operation in the queue hangs indefinitely because
  5. There is a racy deadlock in the azure-storage-blob-go SDK we use

In-depth analysis

In order for an ingester block to be cleared from disk, it has to have been flushed:

func (i *instance) ClearFlushedBlocks(completeBlockTimeout time.Duration) error {
    var err error

    i.blocksMtx.Lock()
    defer i.blocksMtx.Unlock()

    for idx, b := range i.completeBlocks {
        flushedTime := b.FlushedTime()
        if flushedTime.IsZero() {
            continue
        }

        if flushedTime.Add(completeBlockTimeout).Before(time.Now()) {
            i.completeBlocks = append(i.completeBlocks[:idx], i.completeBlocks[idx+1:]...)

            err = i.local.ClearBlock(b.BlockMeta().BlockID, i.instanceID)
            if err == nil {
                metricBlocksClearedTotal.Inc()
            }
            break
        }
    }

    return err
}

The first smoking gun was that blocks that pile up on disk have no flushed file:

/var/tempo/wal/blocks/single-tenant # ls 8386f1fb-32ba-4efd-b1e3-16f608921750
bloom-0       bloom-1       bloom-2       data.parquet  meta.json
/var/tempo/wal/blocks/single-tenant # ls cd61d593-9dc0-4468-b661-1c9b5bbd1350
bloom-0       bloom-1       bloom-2       data.parquet  meta.json
/var/tempo/wal/blocks/single-tenant # ls 04e2d03a-95d5-48f6-bf09-1694fda49f0d
bloom-0       bloom-1       bloom-2       data.parquet  meta.json
/var/tempo/wal/blocks/single-tenant # ls 4c578141-0e05-455f-8e74-fe7803df244a
bloom-0       bloom-1       bloom-2       data.parquet  meta.json
/var/tempo/wal/blocks/single-tenant # ls a40cac2e-9ca1-43bb-a98d-717fefb172c2
bloom-0       bloom-1       bloom-2       data.parquet  meta.json
/var/tempo/wal/blocks/single-tenant # ls 738a8f4f-22bd-4a52-8a7f-493d5bc2f0e7
bloom-0       bloom-1       bloom-2       data.parquet  meta.json

When a Tempo ingester boots, it starts i.cfg.ConcurrentFlushes flush loops:

    i.flushQueuesDone.Add(i.cfg.ConcurrentFlushes)
    for j := 0; j < i.cfg.ConcurrentFlushes; j++ {
        go i.flushLoop(j)
    }

i.cfg.ConcurrentFlushes defaults to 4, which is also the value we have it set to.

i.flushLoop() continuously picks up operations enqueued and processes them synchronously:

func (i *Ingester) flushLoop(j int) {
    defer func() {
        level.Debug(log.Logger).Log("msg", "Ingester.flushLoop() exited")
        i.flushQueuesDone.Done()
    }()

    for {
        o := i.flushQueues.Dequeue(j)
        if o == nil {
            return
        }
        op := o.(*flushOp)
        op.attempts++

        var retry bool
        var err error

        if op.kind == opKindComplete {
            retry, err = i.handleComplete(op)
        } else {
            retry, err = i.handleFlush(context.Background(), op.userID, op.blockID)
        }

        if err != nil {
            handleFailedOp(op, err)
        }

        if retry {
            i.requeue(op)
        } else {
            i.flushQueues.Clear(op)
        }
    }
}

A dead end I spent an unfortunate amount of time on was a theory that flush operations were not being enqueued at all, but this turned out to not be the case.

I eventually built a custom Tempo image with advanced telemetry (read: a bunch of additional log statements) in order to progress.

In doing so, I discovered that all operations were being enqueued, but that flush queues would rarely stop processing operations altogether.

Compare the graph of a healthy queue: healthy

To one of an unhealthy queue (i.e. one that has stopped processing operations): unhealthy

j in the above two graphs is the index of the queue. In the case of this ingester, queue 2 had stopped processing operations, even though the ingester was still running.

A theory was starting to form: a flush queue gets stuck processing an operation, and hangs indefinitely. Next step would be to figure out where the operation gets stuck.

Using my custom log statements, I was able to pinpoint the issue to the handleFlush method:

func (i *Ingester) handleFlush(ctx context.Context, userID string, blockID uuid.UUID) (retry bool, err error) {
    sp, ctx := ot.StartSpanFromContext(ctx, "flush", ot.Tag{Key: "organization", Value: userID}, ot.Tag{Key: "blockID", Value: blockID.String()})
    defer sp.Finish()
    withSpan(level.Info(log.Logger), sp).Log("msg", "flushing block", "userid", userID, "block", blockID.String())

    instance, err := i.getOrCreateInstance(userID)
    if err != nil {
        return true, err
    }

    if instance == nil {
        return false, fmt.Errorf("instance id %s not found", userID)
    }

    if block := instance.GetBlockToBeFlushed(blockID); block != nil {
        ctx := user.InjectOrgID(ctx, userID)
        ctx, cancel := context.WithTimeout(ctx, i.cfg.FlushOpTimeout)
        defer cancel()

        start := time.Now()
        err = i.store.WriteBlock(ctx, block)
        metricFlushDuration.Observe(time.Since(start).Seconds())
        metricFlushSize.Observe(float64(block.BlockMeta().Size))
        if err != nil {
            ext.Error.Set(sp, true)
            sp.LogFields(otlog.Error(err))
            return true, err
        }

        metricBlocksFlushed.Inc()
    } else {
        return false, fmt.Errorf("error getting block to flush")
    }

    return false, nil
}

This method calls i.getOrCreateInstance, which acquires the i.instancesMtx lock:

func (i *Ingester) getOrCreateInstance(instanceID string) (*instance, error) {
    inst, ok := i.getInstanceByID(instanceID)
    if ok {
        return inst, nil
    }

    i.instancesMtx.Lock()
    defer i.instancesMtx.Unlock()
    inst, ok = i.instances[instanceID]
    if !ok {
        var err error
        inst, err = newInstance(instanceID, i.limiter, i.overrides, i.store, i.local, i.cfg.AutocompleteFilteringEnabled, i.cfg.DedicatedColumns)
        if err != nil {
            return nil, err
        }
        i.instances[instanceID] = inst
    }
    return inst, nil
}

Once again, I spent a considerable amount of time chasing a theory that a mutex deadlock was happening somewhere in the ingester. My suspicion that a mutex deadlock was at play was further exacerbated by this comment I found:

        // Now that we are adding a new block take the blocks mutex.
        // A warning about deadlocks!!  This area does a hard-acquire of both mutexes.
        // To avoid deadlocks this function and all others must acquire them in
        // the ** same_order ** or else!!! i.e. another function can't acquire blocksMtx
        // then headblockMtx. Even if the likelihood is low it is a statistical certainly
        // that eventually a deadlock will occur.

getOrCreateInstance doesn't acquire any of the locks mentioned in the comment, but I figured there could be an oversight somewhere. I was unable to find evidence to support this theory (and it was not for lack of trying 😄), and decided to move on.

Additional log statements revealed that the operation gets stuck when uploading the block to Azure (we use Azure as our storage backend): err = i.store.WriteBlock(ctx, block).

Further digging took me to Tempo's Azure writer implementation:

func (rw *readerWriter) writer(ctx context.Context, src io.Reader, name string) error {
    blobURL := rw.containerURL.NewBlockBlobURL(name)

    if _, err := blob.UploadStreamToBlockBlob(ctx, src, blobURL,
        blob.UploadStreamToBlockBlobOptions{
            BufferSize: rw.cfg.BufferSize,
            MaxBuffers: rw.cfg.MaxBuffers,
        },
    ); err != nil {
        return errors.Wrapf(err, "cannot upload blob, name: %s", name)
    }
    return nil
}

My theory at this point was that the passed ctx was not being respected, or was being overwritten downstream.

By default, the ingester passes a five minute context timeout to the write operation:

        ctx, cancel := context.WithTimeout(ctx, i.cfg.FlushOpTimeout)
        defer cancel()

        start := time.Now()
        err = i.store.WriteBlock(ctx, block)

Following the context's trail, I was not able to find evidence of it being overwritten in Tempo's code, so I started looking at the azure-storage-blob-go SDK we use.

The SDK's blob.UploadStreamToBlockBlob function sets a couple of defaults and then calls the internal copyFromReader() function, which is there the meat of the upload operation resides.

Inspecting the code, I did not find evidence that the context was being overwritten or not respected, and this theory ended up being a red herring as well.

What I did find was these two concurrency items:

When a Tempo ingester uploads a block to Azure storage, it chunks the block's data.parquet file into 3MiB chunks.

For each chunk, a goroutine is spawned through the SDK's TransferManager abstraction:

        // Some data was read, schedule the write.
        id := c.id.next()
        c.wg.Add(1)
        c.o.TransferManager.Run(
            func() {
                defer c.wg.Done()
                c.write(copierChunk{buffer: buffer, id: id, length: n})
            },
        )

As we can see, a weight of one is added to the c.wg before each upload goroutine.

This is the write method:

// write uploads a chunk to blob storage.
func (c *copier) write(chunk copierChunk) {
    defer c.o.TransferManager.Put(chunk.buffer)

    if err := c.ctx.Err(); err != nil {
        return
    }

    _, err := c.to.StageBlock(c.ctx, chunk.id, bytes.NewReader(chunk.buffer[:chunk.length]), c.o.AccessConditions.LeaseAccessConditions, nil, c.o.ClientProvidedKeyOptions)
    if err != nil {
        c.errCh <- fmt.Errorf("write error: %w", err)
        return
    }
}

At the end of the write method, an err != nil check is performed. If an error is encountered, the error is sent to the c.errCh channel, and the method exits. Crucially, since c.errCh is buffered, if the channel already contains an error, this statement will block the goroutine.

The only place the c.errCh channel is read from is the getErr() method:

// getErr returns an error by priority. First, if a function set an error, it returns that error. Next, if the Context has an error
// it returns that error. Otherwise it is nil. getErr supports only returning an error once per copier.
func (c *copier) getErr() error {
    select {
    case err := <-c.errCh:
        return err
    default:
    }
    return c.ctx.Err()
}

This method is called as the first thing when processing each chunk:

    if err := c.getErr(); err != nil {
        return err
    }

If sendChunk() returns an error, the outer loop breaks immediately:

    // Send all our chunks until we get an error.
    var err error
    for {
        if err = cp.sendChunk(); err != nil {
            break
        }
    }
    // If the error is not EOF, then we have a problem.
    if err != nil && !errors.Is(err, io.EOF) {
        cp.wg.Wait()
        return nil, err
    }

This is the critical moment, as the outer method cannot return before cp.wg.Wait() exits. As we saw earlier, c.wg.Done() is called when write() returns, but write() can't return until it has written its error, which it cannot do if the error channel already contains an item.

Consider four chunks A, B, C and D. This is what happens:

  1. Chunk A's upload goroutine is started. c.wg is at one and the error channel is empty.
  2. Chunk B's upload goroutine is started. c.wg is at two and the error channel is empty.
  3. Chunk C's upload goroutine is started. c.wg is at three and the error channel is empty.
  4. Uploading chunk A fails. c.wg is at two and the error channel contains one error.
  5. Uploading chunk B fails. Chunk B's write method cannot return since the error channel already contains an error. c.wg is at two and the error channel contains one error.
  6. Uploading chunk C fails. Chunk C's write method cannot return since the error channel already contains an error. c.wg is at two and the error channel contains one error.
  7. Before processing chunk D, the errors channel is checked. An error is found, and the chunk loop breaks. The outer function is now blocked waiting for cp.wg.Wait().
  8. The errors channel is now empty, so chunk B's write method can send its error and return. c.wg is now at one, and the error channel contains one error.
  9. We've reached our deadlock. cp.wg.Wait() is still blocking since the wait group has a weight of one, but the final weight in the group cannot be resolved since chunk C cannot send its error to the errors channel. The errors channel will never be cleared, since the outer loop has exited.

In order for this to happen, multiple uploads have to fail, which is exactly what we see: errors

The azure-storage-blob-go SDK we use has been superseded by the newer azure-sdk-for-go SDK.

The azure-sdk-for-go SDK contains a different implementation of chunkwriting.go without this race condition:

                if err != nil {
                    select {
                    case errCh <- err:
                        // error was set
                    default:
                        // some other error is already set
                    }
                    cancel()
                }

Using a select statement to explicitly check whether an error has already been sent to the channel prevents multiple errors from blocking indefinitely.

As such, it is my tentative conclusion that upgrading to the new SDK will solve this problem. I'd be happy to take a look at this task, or at the very least evaluate how large of a task it is.

LasseHels commented 1 year ago

The new SDK may also help with these two issues:

joe-elliott commented 1 year ago

Wow, this is an impressive analysis. Thank you for digging this out. Since this is not currently impacting us we would likely have never been able to dig to the bottom of it internally. Great work!

I am 100% on board with upgrading the azure blob storage client. As you've found the one we are using is no longer maintained.

I'd be happy to take a look at this task, or at the very least evaluate how large of a task it is.

If you can do a first pass analysis on how big this is, we would appreciate it. Hopefully, it's a small lift.

I eventually built a custom Tempo image with advanced telemetry (read: a bunch of additional log statements)

This cracked me up. I too have dabbled in "advanced telemetry".

LasseHels commented 1 year ago

@joe-elliott

I am 100% on board with upgrading the azure blob storage client.

Roger that.

If you can do a first pass analysis on how big this is, we would appreciate it. Hopefully, it's a small lift.

I'd be happy to do that. My team is releasing a new internal product this Monday, so I might not get a chance to continue with this until next week, or the week after if things get busy.

I've taken the liberty of creating a new issue for tracking work on the SDK upgrade: https://github.com/grafana/tempo/issues/2835; I wouldn't want communication related to the upgrade to get buried in this issue.

LasseHels commented 11 months ago

For the record, we finally got around to enabling the V2 SDK on December 6th and have not seen the issue since: Screenshot 2023-12-12 at 12 43 09

joe-elliott commented 11 months ago

That's fantastic! Thanks for all your hard work on this @LasseHels. After we get some confidence in the v2 azure backend we will make it default.