Open mazad01 opened 3 months ago
This doesn't seem like a bug. Is it a 404 from GCS? Do you have any GCS metrics to check? I think it just means the no downsample marker was not found, which is expected.
Verified with our GCS team, and they don't see any issues on their side. It's also not just downsample markers. There's logs such as:
{"caller":"compact.go:553","err":"sync before retention: filter metas: filter blocks marked for deletion: get file: 01HZ6J56Y6FWE1J6PX22QSPG5K/deletion-mark.json: Get \"https://storage.googleapis.com/instance/01HZ6J56Y6FWE1J6PX22QSPG5K%2Fdeletion-mark.json\": dial tcp 1<ip>: connect: cannot assign requested address","level":"error","msg":"retriable error","ts":"2024-08-12T06:30:29.010988725Z"}
To add more details to this: if a compactor pod is bounced, it maybe goes 2-3 hours without the errors. Doesn't seem to sound like gcs network saturation.
Is there anything else I can try (debug logs don't really help much)?
Edit: it also happens on pod boot. So ignore my comment about it not going 2-3 without errors.
Some other info.
cpu usage after update:
The thanos_compact_retries_total metric was a no factor prior to upgrade. Started right after update:
Updated from Thanos 0.28.1 to Thanos 0.36.0. My compactor has been giving the following error across many instances (but not all):
Tried updating cpu, as I noticed that increased significantly with the version update to no avail. Help appreciated. Thanks!