thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.01k stars 2.09k forks source link

Thanos-Compact halting with error 'err="compaction: group 0@17832940732465865817: overlapping sources detected' #6389

Open Migueljfs opened 1 year ago

Migueljfs commented 1 year ago

Thanos, Prometheus and Golang version used: Thanos: 0.31.0 Prometheus: 2.44.0

Object Storage Provider: Google (GCS)

What happened: Thanos-compact pod halted shortly after starting with error:

│ level=error ts=2023-05-23T13:07:54.219139563Z caller=compact.go:487 msg="critical error detected; halting" err="compaction: group 0@17832940732465865817: overlapping sources detected for plan [01GZ0CAKEYN0BVTRW570T69QKM (min time: 1682553600388, max time: 1682560800000) 01H0FY8AR60GEVX3J3Q7Y7GSYG (min time: 1682553600388, max time: 1682596800000) 01GZ0K651WJG1M9W2DMC050NHF (min time: 1682560800388, max time: 1682568000000) 01GZ0R6020JTZ21GETCZCY5Y57 (min time: 1682568000388, max time: 1682575200000) 01GZ0Z1QA4HWJ4SX0RXJFZTM5X (min time: 1682575200388, max time: 1682582400000) 01GZ15XEJ2M6M0QH50XRQC3M65 (min time: 1682582400388, max time: 1682589600000) 01GZ1CS5T1VQ52EWBAWBSX7Z3W (min time: 1682589600388, max time: 1682596800000)]"

What you expected to happen: I believe the thanos compact should be able to deduplicate or merge blocks if that is the case? Not really sure

Full logs to relevant components: I inspected the bucket and the blocks ID in the error message are the following:

|            ULID            |         FROM         |        UNTIL         |     RANGE      |   UNTIL-DOWN    |  #SERIES   |    #SAMPLES    |   #CHUNKS   | COMP-LEVEL | COMP-FAILED |                                                           LABELS                                                            | RESOLUTION |  SOURCE   |
|----------------------------|----------------------|----------------------|----------------|-----------------|------------|----------------|-------------|------------|-------------|-----------------------------------------------------------------------------------------------------------------------------|------------|-----------|
| 01H0FY8AR60GEVX3J3Q7Y7GSYG | 2023-04-27T00:00:00Z | 2023-04-27T12:00:00Z | 11h59m59.612s  | 28h0m0.388s     | 3,219      | 6,655,129      | 59,010      | 3          | false       | cluster=operations-staging,thanos_ruler_replica=thanos-ruler-evaluator-1                                                    | 0s         | compactor |
| 01GZ0CAKEYN0BVTRW570T69QKM | 2023-04-27T00:00:00Z | 2023-04-27T02:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,117      | 1,107,200      | 9,811       | 2          | false       | cluster=operations-staging                                                                                                  | 0s         | compactor |
| 01GZ0K651WJG1M9W2DMC050NHF | 2023-04-27T02:00:00Z | 2023-04-27T04:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,117      | 1,107,198      | 9,812       | 2          | false       | cluster=operations-staging                                                                                                  | 0s         | compactor |
| 01GZ0R6020JTZ21GETCZCY5Y57 | 2023-04-27T04:00:00Z | 2023-04-27T06:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,123      | 1,107,288      | 9,817       | 1          | false       | cluster=operations-staging,thanos_ruler_replica=thanos-ruler-evaluator-0                                                    | 0s         | ruler     |
| 01GZ0Z1QA4HWJ4SX0RXJFZTM5X | 2023-04-27T06:00:00Z | 2023-04-27T08:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,134      | 1,107,689      | 9,827       | 1          | false       | cluster=operations-staging,thanos_ruler_replica=thanos-ruler-evaluator-0                                                    | 0s         | ruler     |
| 01GZ15XEJ2M6M0QH50XRQC3M65 | 2023-04-27T08:00:00Z | 2023-04-27T10:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,134      | 1,108,467      | 9,831       | 1          | false       | cluster=operations-staging,thanos_ruler_replica=thanos-ruler-evaluator-0                                                    | 0s         | ruler     |
| 01GZ1CS5T1VQ52EWBAWBSX7Z3W | 2023-04-27T10:00:00Z | 2023-04-27T12:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,203      | 1,117,342      | 9,908       | 1          | false       | cluster=operations-staging,thanos_ruler_replica=thanos-ruler-evaluator-0                                                    | 0s         | ruler     |
mhoffm-aiven commented 1 year ago

How is compactor configured? It looks like a historical compactor had different configuration than new one because the first block still has the replica label

Migueljfs commented 1 year ago

I have 5 sharded compactors running with this config:

        - compact
        - --wait
        - --log.level=info
        - --log.format=logfmt
        - --objstore.config=$(OBJSTORE_CONFIG)
        - --data-dir=/var/thanos/compact
        - --debug.accept-malformed-index
        - --retention.resolution-raw=2y
        - --retention.resolution-5m=2y
        - --retention.resolution-1h=2y
        - --delete-delay=48h
        - --compact.concurrency=1
        - --downsample.concurrency=1
        - --deduplication.replica-label=prometheus_replica
        - --deduplication.replica-label=receive_replica
        - --deduplication.replica-label=thanos_ruler_replica
        - --compact.enable-vertical-compaction
        - |-
          --selector.relabel-config=
            - action: hashmod
              source_labels: ["cluster"]
              target_label: shard
              modulus: 5
            - action: keep
              source_labels: ["shard"]
              regex: 0

With regex: 0 going from 0 to 4.

Btw this is the exact same config I deploy on my other clusters (different clusters environments for different buckets) and only this one is giving these errors.

mhoffm-aiven commented 1 year ago

My guess is that it ran before without the "thanos_ruler_replica" dedup label? since the 01H0FY8AR60GEVX3J3Q7Y7GSYG block still has it even though tis already compacted and appears in the compaction plan. You could mark it as no-compact probably?

Migueljfs commented 1 year ago

It's possible, It's been a while so I don't remember to be honest.

Either way, since then what I did is remove the chunks directly from my bucket (this is staging env so I don't care that much about the data itself, I just wanted to understand how I can solve this in case it comes up in prod)

However, eventually, thanos-compact halts again on a new set of chunks. Then I delete them, thanos-compact starts running until it halts again, etc

It's been like this for the past 2 weeks, and I have deleted a bunch of chunks, I thought there were a bunch of corrupted chunks or something like that but I'm starting to think it will be forever like this and I can't understand why

jaspreet-yb commented 10 months ago

Facing the same issue due to which compaction is getting halted ts=2023-11-28T08:24:05.350841732Z caller=compact.go:491 level=error msg="critical error detected; halting" err="compaction: group 300000@5350783008949816695: failed to run pre compaction callback for plan: [01HG7EDJNBXVZ2J0PQS2HT87Q8 (min time: 1699488000002, max time: 1700697600000) 01HGAG0TKWJ09CS7SBAXGXTBME (min time: 1700402400000, max time: 1700697600000)]: overlapping sources detected for plan [01HG7EDJNBXVZ2J0PQS2HT87Q8 (min time: 1699488000002, max time: 1700697600000) 01HGAG0TKWJ09CS7SBAXGXTBME (min time: 1700402400000, max time: 1700697600000)]"

Our current thanos compact config

    Args:
      compact
      --wait
      --log.level=info
      --log.format=logfmt
      --objstore.config=$(OBJSTORE_CONFIG)
      --data-dir=/var/thanos/compact
      --debug.accept-malformed-index
      --retention.resolution-raw=7d
      --retention.resolution-5m=30d
      --retention.resolution-1h=545d
      --delete-delay=48h
      --deduplication.replica-label=prometheus_replica
      --compact.enable-vertical-compaction
      --deduplication.func=penalty

We have 6 shards and 2 replicas for prometheus

Kot-o-pes commented 3 months ago

Hi there, faced this issue too thanos-compactor[2635699]: {"caller":"compact.go:527","err":"compaction: group 300000@7488097868448971783: failed to run pre compaction callback for plan: [01HZN4SYGV9HE0ZJC6ZMY31JVK (min time: 1710374400000, max time: 1711411200000) 01J06BGV9K11S7TV8JB35HAQRK (min time: 1711065600000, max time: 1711584000000)]: overlapping sources detected for plan [01HZN4SYGV9HE0ZJC6ZMY31JVK (min time: 1710374400000, max time: 1711411200000) 01J06BGV9K11S7TV8JB35HAQRK (min time: 1711065600000, max time: 1711584000000)]","level":"error","msg":"critical error detected; halting","ts":"2024-06-13T07:41:53.210408962Z"} prometheus has 3 replicas

| 01HZN4SYGV9HE0ZJC6ZMY31JVK | 2024-03-14T03:00:00+03:00 | 2024-03-26T03:00:00+03:00 | 288h0m0s       | -48h0m0s        | 66,629,361  | 54,953,985,276  | 511,345,472   | 5          | false       | cluster=k8s.prod,environment=prod,manage_by=flux,prometheus=monitoring/prom-operator-prometheus                                                                         | 5m0s       | compactor |

| 01J06BGV9K11S7TV8JB35HAQRK | 2024-03-22T03:00:00+03:00 | 2024-03-28T03:00:00+03:00 | 144h0m0s       | 96h0m0s         | 40,684,997  | 27,949,523,210  | 232,661,558   | 5          | false       | cluster=k8s.prod,environment=prod,manage_by=flux,prometheus=monitoring/prom-operator-prometheus 

tried to add no compact, also found this issue about compact marks being ignored https://github.com/thanos-io/thanos/issues/5603