thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.1k stars 2.1k forks source link

Thanos update causing compactor to err on sync before retention #7616

Open mazad01 opened 3 months ago

mazad01 commented 3 months ago

Updated from Thanos 0.28.1 to Thanos 0.36.0. My compactor has been giving the following error across many instances (but not all):

{"caller":"compact.go:553","err":"syncing metas: filter metas: filter blocks marked for no downsample: get file: 01HTBG8EWAJGTQP6EHTPB37705/no-downsample-mark.json: Get \"https://storage.googleapis.com/instance-1/01HTBG8EWAJGTQP6EHTPB37705%2Fno-downsample-mark.json\": dial tcp <ip>: connect: cannot assign requested address","level":"error","msg":"retriable error","ts":"2024-08-09T00:04:41.953444998Z"}

Tried updating cpu, as I noticed that increased significantly with the version update to no avail. Help appreciated. Thanks!

yeya24 commented 3 months ago

This doesn't seem like a bug. Is it a 404 from GCS? Do you have any GCS metrics to check? I think it just means the no downsample marker was not found, which is expected.

mazad01 commented 2 months ago

Verified with our GCS team, and they don't see any issues on their side. It's also not just downsample markers. There's logs such as:

{"caller":"compact.go:553","err":"sync before retention: filter metas: filter blocks marked for deletion: get file: 01HZ6J56Y6FWE1J6PX22QSPG5K/deletion-mark.json: Get \"https://storage.googleapis.com/instance/01HZ6J56Y6FWE1J6PX22QSPG5K%2Fdeletion-mark.json\": dial tcp 1<ip>: connect: cannot assign requested address","level":"error","msg":"retriable error","ts":"2024-08-12T06:30:29.010988725Z"}

To add more details to this: if a compactor pod is bounced, it maybe goes 2-3 hours without the errors. Doesn't seem to sound like gcs network saturation.

Is there anything else I can try (debug logs don't really help much)?

Edit: it also happens on pod boot. So ignore my comment about it not going 2-3 without errors.

mazad01 commented 2 months ago

Some other info.

cpu usage after update:

image

The thanos_compact_retries_total metric was a no factor prior to upgrade. Started right after update:

image