Open Migueljfs opened 1 year ago
How is compactor configured? It looks like a historical compactor had different configuration than new one because the first block still has the replica label
I have 5 sharded compactors running with this config:
- compact
- --wait
- --log.level=info
- --log.format=logfmt
- --objstore.config=$(OBJSTORE_CONFIG)
- --data-dir=/var/thanos/compact
- --debug.accept-malformed-index
- --retention.resolution-raw=2y
- --retention.resolution-5m=2y
- --retention.resolution-1h=2y
- --delete-delay=48h
- --compact.concurrency=1
- --downsample.concurrency=1
- --deduplication.replica-label=prometheus_replica
- --deduplication.replica-label=receive_replica
- --deduplication.replica-label=thanos_ruler_replica
- --compact.enable-vertical-compaction
- |-
--selector.relabel-config=
- action: hashmod
source_labels: ["cluster"]
target_label: shard
modulus: 5
- action: keep
source_labels: ["shard"]
regex: 0
With regex: 0
going from 0 to 4.
Btw this is the exact same config I deploy on my other clusters (different clusters environments for different buckets) and only this one is giving these errors.
My guess is that it ran before without the "thanos_ruler_replica" dedup label? since the 01H0FY8AR60GEVX3J3Q7Y7GSYG
block still has it even though tis already compacted and appears in the compaction plan. You could mark it as no-compact probably?
It's possible, It's been a while so I don't remember to be honest.
Either way, since then what I did is remove the chunks directly from my bucket (this is staging env so I don't care that much about the data itself, I just wanted to understand how I can solve this in case it comes up in prod)
However, eventually, thanos-compact halts again on a new set of chunks. Then I delete them, thanos-compact starts running until it halts again, etc
It's been like this for the past 2 weeks, and I have deleted a bunch of chunks, I thought there were a bunch of corrupted chunks or something like that but I'm starting to think it will be forever like this and I can't understand why
Facing the same issue due to which compaction is getting halted
ts=2023-11-28T08:24:05.350841732Z caller=compact.go:491 level=error msg="critical error detected; halting" err="compaction: group 300000@5350783008949816695: failed to run pre compaction callback for plan: [01HG7EDJNBXVZ2J0PQS2HT87Q8 (min time: 1699488000002, max time: 1700697600000) 01HGAG0TKWJ09CS7SBAXGXTBME (min time: 1700402400000, max time: 1700697600000)]: overlapping sources detected for plan [01HG7EDJNBXVZ2J0PQS2HT87Q8 (min time: 1699488000002, max time: 1700697600000) 01HGAG0TKWJ09CS7SBAXGXTBME (min time: 1700402400000, max time: 1700697600000)]"
Our current thanos compact config
Args:
compact
--wait
--log.level=info
--log.format=logfmt
--objstore.config=$(OBJSTORE_CONFIG)
--data-dir=/var/thanos/compact
--debug.accept-malformed-index
--retention.resolution-raw=7d
--retention.resolution-5m=30d
--retention.resolution-1h=545d
--delete-delay=48h
--deduplication.replica-label=prometheus_replica
--compact.enable-vertical-compaction
--deduplication.func=penalty
We have 6 shards and 2 replicas for prometheus
Hi there, faced this issue too
thanos-compactor[2635699]: {"caller":"compact.go:527","err":"compaction: group 300000@7488097868448971783: failed to run pre compaction callback for plan: [01HZN4SYGV9HE0ZJC6ZMY31JVK (min time: 1710374400000, max time: 1711411200000) 01J06BGV9K11S7TV8JB35HAQRK (min time: 1711065600000, max time: 1711584000000)]: overlapping sources detected for plan [01HZN4SYGV9HE0ZJC6ZMY31JVK (min time: 1710374400000, max time: 1711411200000) 01J06BGV9K11S7TV8JB35HAQRK (min time: 1711065600000, max time: 1711584000000)]","level":"error","msg":"critical error detected; halting","ts":"2024-06-13T07:41:53.210408962Z"}
prometheus has 3 replicas
| 01HZN4SYGV9HE0ZJC6ZMY31JVK | 2024-03-14T03:00:00+03:00 | 2024-03-26T03:00:00+03:00 | 288h0m0s | -48h0m0s | 66,629,361 | 54,953,985,276 | 511,345,472 | 5 | false | cluster=k8s.prod,environment=prod,manage_by=flux,prometheus=monitoring/prom-operator-prometheus | 5m0s | compactor |
| 01J06BGV9K11S7TV8JB35HAQRK | 2024-03-22T03:00:00+03:00 | 2024-03-28T03:00:00+03:00 | 144h0m0s | 96h0m0s | 40,684,997 | 27,949,523,210 | 232,661,558 | 5 | false | cluster=k8s.prod,environment=prod,manage_by=flux,prometheus=monitoring/prom-operator-prometheus
tried to add no compact, also found this issue about compact marks being ignored https://github.com/thanos-io/thanos/issues/5603
Thanos, Prometheus and Golang version used: Thanos: 0.31.0 Prometheus: 2.44.0
Object Storage Provider: Google (GCS)
What happened: Thanos-compact pod halted shortly after starting with error:
What you expected to happen: I believe the thanos compact should be able to deduplicate or merge blocks if that is the case? Not really sure
Full logs to relevant components: I inspected the bucket and the blocks ID in the error message are the following: