thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.97k stars 2.08k forks source link

thanos-sidecar uploads OOO block before Prometheus compacts it, upsetting thanos-compact #7551

Open kennylevinsen opened 2 months ago

kennylevinsen commented 2 months ago

Thanos, Prometheus and Golang version used:

thanos, version 0.35.1 (branch: HEAD, revision: 086a698b2195adb6f3463ebbd032e780f39d2050)
  build user:       root@be0f036fd8fa
  build date:       20240528-13:54:20
  go version:       go1.21.10
  platform:         linux/amd64
  tags:             netgo

prometheus, version 2.52.0 (branch: HEAD, revision: 879d80922a227c37df502e7315fad8ceb10a986d)
  build user:       root@1b4f4c206e41
  build date:       20240508-21:56:43
  go version:       go1.22.3
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

Object Storage Provider:

S3 (minio)

What happened:

Prometheus is configured with a 1h out_of_order_time_window to allow some slack from a very large number grafana agents.

Every once in a while, after writing new normal block, prometheus will write one or more out-of-order blocks. These blocks sometimes cover the current metric window (e.g., at 01:00 an out of order block is written that covers 00:00 to 02:00). Sidecar uploads this block immediately.

When prometheus finishes up the current window (e.g., at 03:00 a normal block is written that covers 00:00 to 02:00), Sidecar uploads the new block. At this point, the thanos compact discovers an overlap and gets upset.

What you expected to happen:

Either:

  1. Wait for Prometheus to compact blocks before upload, which merges the out of order blocks to create a regular block.
  2. Have the compactor merge blocks with the out of order hint with the regular block, complaining only if two blocks overlap without the hint.

I don't think Prometheus is doing anything wrong here in how it creates the blocks, but please do correct me if I'm wrong.

How to reproduce it (as minimally and precisely as possible):

I have not tried to reproduce externally, but I believe this would be the way:

  1. Run prometheus with a 1h out of order window
  2. Have agents perform remote write with a possibly large clock skew, causing prometheus to write an ooo block for the current window.
  3. Run thanos sidecar to upload and thanos compact

Based on the logs, it seems that sidecar doesn't always react to the ooo blocks.

Full logs to relevant components:

prometheus[962]: ts=2024-07-16T01:01:08.352Z caller=compact.go:576 level=info component=tsdb msg="write block" mint=1721080800131 maxt=1721088000000 ulid=01J2WGVCKCEV3YEH5R6ZQ8AN00 duration=1m7.347661301s ooo=false
prometheus[962]: ts=2024-07-16T01:01:08.392Z caller=db.go:1680 level=info component=tsdb msg="Deleting obsolete block" block=01J2SYJ89FMK5Y05KQFTSJJ4SB

# Regular block (22:00 -> 00:00) being upload
thanos[977]: ts=2024-07-16T01:01:12.463204977Z caller=shipper.go:372 level=info msg="upload new block" id=01J2WGVCKCEV3YEH5R6ZQ8AN00
prometheus[962]: ts=2024-07-16T01:01:20.991Z caller=head.go:1345 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=12.598972304s
prometheus[962]: ts=2024-07-16T01:01:21.393Z caller=checkpoint.go:101 level=info component=tsdb msg="Creating checkpoint" from_segment=6026 to_segment=6059 mint=1721088000000
prometheus[962]: ts=2024-07-16T01:01:56.076Z caller=head.go:1307 level=info component=tsdb msg="WAL checkpoint complete" first=6026 last=6059 duration=34.683165971s
prometheus[962]: ts=2024-07-16T01:02:15.274Z caller=compact.go:576 level=info component=tsdb msg="write block" mint=1721080800000 maxt=1721088000000 ulid=01J2WGZEQ1B7T446RBNSR0YWN3 duration=1.033067754s ooo=true
prometheus[962]: ts=2024-07-16T01:02:15.491Z caller=compact.go:576 level=info component=tsdb msg="write block" mint=1721088000000 maxt=1721095200000 ulid=01J2WGZFQBHM1MDE2F60ZZQGJY duration=216.915065ms ooo=true
prometheus[962]: ts=2024-07-16T01:02:15.492Z caller=db.go:1324 level=info component=tsdb msg="out-of-order compaction completed" duration=1.250113817s ulids="[01J2WGZEQ1B7T446RBNSR0YWN3 01J2WGZFQBHM1MDE2F60ZZQGJY]"
prometheus[962]: ts=2024-07-16T01:02:15.493Z caller=db.go:1517 level=warn component=tsdb msg="Overlapping blocks found during reloadBlocks" detail="[mint: 1721080800131, maxt: 1721088000000, range: 1h59m59s, blocks: 2]: <ulid: 01J2WGZEQ1B7T446RBNSR0YWN3, mint: 1721080800000, maxt: 1721088000000, range: 2h0m0s>, <ulid: 01J2WGVCKCEV3YEH5R6ZQ8AN00, mint: 1721080800131, maxt: 1721088000000, range: 1h59m59s>"
prometheus[962]: ts=2024-07-16T01:02:15.567Z caller=db.go:1680 level=info component=tsdb msg="Deleting obsolete block" block=01J2T5AD70YGACS2K6P4M74XMB
prometheus[962]: ts=2024-07-16T01:02:18.158Z caller=head.go:1345 level=info component=tsdb msg="Head GC completed" caller=truncateOOO duration=2.59074302s
prometheus[962]: ts=2024-07-16T01:02:18.170Z caller=compact.go:762 level=info component=tsdb msg="Found overlapping blocks during compaction" ulid=01J2WGZJHF2WVHJG6123DFY0HH
prometheus[962]: ts=2024-07-16T01:03:00.330Z caller=compact.go:514 level=info component=tsdb msg="compact blocks" count=2 mint=1721080800000 maxt=1721088000000 ulid=01J2WGZJHF2WVHJG6123DFY0HH sources="[01J2WGZEQ1B7T446RBNSR0YWN3 01J2WGVCKCEV3YEH5R6ZQ8AN00]" duration=42.171127583s
prometheus[962]: ts=2024-07-16T01:03:00.338Z caller=db.go:1680 level=info component=tsdb msg="Deleting obsolete block" block=01J2WGZEQ1B7T446RBNSR0YWN3
prometheus[962]: ts=2024-07-16T01:03:00.434Z caller=db.go:1680 level=info component=tsdb msg="Deleting obsolete block" block=01J2WGVCKCEV3YEH5R6ZQ8AN00

# Out of order block (01J2WGZFQBHM1MDE2F60ZZQGJY) for the current span (00:00 -> 02:00) being uploaded
thanos[977]: ts=2024-07-16T01:14:25.738032961Z caller=shipper.go:372 level=info msg="upload new block" id=01J2WGZFQBHM1MDE2F60ZZQGJY
prometheus[962]: ts=2024-07-16T03:01:02.537Z caller=compact.go:576 level=info component=tsdb msg="write block" mint=1721088000000 maxt=1721095200000 ulid=01J2WQQ4CJ9J721M98KRXP14DZ duration=1m0.983472513s ooo=false
prometheus[962]: ts=2024-07-16T03:01:02.539Z caller=db.go:1517 level=warn component=tsdb msg="Overlapping blocks found during reloadBlocks" detail="[mint: 1721088000000, maxt: 1721095200000, range: 2h0m0s, blocks: 2]: <ulid: 01J2WQQ4CJ9J721M98KRXP14DZ, mint: 1721088000000, maxt: 1721095200000, range: 2h0m0s>, <ulid: 01J2WGZFQBHM1MDE2F60ZZQGJY, mint: 1721088000000, maxt: 1721095200000, range: 2h0m0s>"
prometheus[962]: ts=2024-07-16T03:01:11.475Z caller=head.go:1345 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=8.936572029s
prometheus[962]: ts=2024-07-16T03:01:11.769Z caller=checkpoint.go:101 level=info component=tsdb msg="Creating checkpoint" from_segment=6060 to_segment=6094 mint=1721095200000

# Regular block (01J2WQQ4CJ9J721M98KRXP14DZ) for the current span (00:00 -> 02:00) being uploaded, making thanos compact conflict
thanos[977]: ts=2024-07-16T03:01:12.456523724Z caller=shipper.go:372 level=info msg="upload new block" id=01J2WQQ4CJ9J721M98KRXP14DZ
prometheus[962]: ts=2024-07-16T03:01:49.198Z caller=head.go:1307 level=info component=tsdb msg="WAL checkpoint complete" first=6060 last=6094 duration=37.429718752s
prometheus[962]: ts=2024-07-16T03:02:01.797Z caller=db.go:1317 level=info component=tsdb msg="compact ooo head resulted in no blocks" duration=4.428µs
prometheus[962]: ts=2024-07-16T03:02:01.797Z caller=db.go:1517 level=warn component=tsdb msg="Overlapping blocks found during reloadBlocks" detail="[mint: 1721088000000, maxt: 1721095200000, range: 2h0m0s, blocks: 2]: <ulid: 01J2WQQ4CJ9J721M98KRXP14DZ, mint: 1721088000000, maxt: 1721095200000, range: 2h0m0s>, <ulid: 01J2WGZFQBHM1MDE2F60ZZQGJY, mint: 1721088000000, maxt: 1721095200000, range: 2h0m0s>"
prometheus[962]: ts=2024-07-16T03:02:01.806Z caller=compact.go:762 level=info component=tsdb msg="Found overlapping blocks during compaction" ulid=01J2WQTST65A7J56MY6RQBRQHF

# Prometheus solves conflict between the OOO block (01J2WGZFQBHM1MDE2F60ZZQGJY) and regular block (01J2WQQ4CJ9J721M98KRXP14DZ), obsoleting both
prometheus[962]: ts=2024-07-16T03:02:39.409Z caller=compact.go:514 level=info component=tsdb msg="compact blocks" count=2 mint=1721088000000 maxt=1721095200000 ulid=01J2WQTST65A7J56MY6RQBRQHF sources="[01J2WGZFQBHM1MDE2F60ZZQGJY 01J2WQQ4CJ9J721M98KRXP14DZ]" duration=37.611596935s
prometheus[962]: ts=2024-07-16T03:02:39.414Z caller=db.go:1680 level=info component=tsdb msg="Deleting obsolete block" block=01J2WGZFQBHM1MDE2F60ZZQGJY
prometheus[962]: ts=2024-07-16T03:02:39.541Z caller=db.go:1680 level=info component=tsdb msg="Deleting obsolete block" block=01J2WQQ4CJ9J721M98KRXP14DZ
prometheus[962]: ts=2024-07-16T05:01:06.041Z caller=compact.go:576 level=info component=tsdb msg="write block" mint=1721095200011 maxt=1721102400000 ulid=01J2WYJVFCCKHPKEZYR3K10EMY duration=1m4.65230713s ooo=false
prometheus[962]: ts=2024-07-16T05:01:06.101Z caller=db.go:1680 level=info component=tsdb msg="Deleting obsolete block" block=01J2TC64YWY3K7RSNMGCDCHARB

# Regualr block for 02:00 -> 04:00 being uploaded
thanos[977]: ts=2024-07-16T05:01:12.467446639Z caller=shipper.go:372 level=info msg="upload new block" id=01J2WYJVFCCKHPKEZYR3K10EMY
prometheus[962]: ts=2024-07-16T05:01:15.663Z caller=head.go:1345 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=9.562823399s
prometheus[962]: ts=2024-07-16T05:01:15.958Z caller=checkpoint.go:101 level=info component=tsdb msg="Creating checkpoint" from_segment=6095 to_segment=6129 mint=1721102400000
prometheus[962]: ts=2024-07-16T05:01:52.723Z caller=head.go:1307 level=info component=tsdb msg="WAL checkpoint complete" first=6095 last=6129 duration=36.765643922s
prometheus[962]: ts=2024-07-16T05:02:06.221Z caller=db.go:1317 level=info component=tsdb msg="compact ooo head resulted in no blocks" duration=4.956µs
prometheus[962]: ts=2024-07-16T07:01:01.626Z caller=compact.go:576 level=info component=tsdb msg="write block" mint=1721102400003 maxt=1721109600000 ulid=01J2X5EJJ78HWJXNDSC79RVFCM duration=1m0.402891801s ooo=false
prometheus[962]: ts=2024-07-16T07:01:01.692Z caller=db.go:1680 level=info component=tsdb msg="Deleting obsolete block" block=01J2TK1W98NSE8JDQZTTT5D3VM

# (other machine) Thanos compact getting upset about the overlap between the OOO and regular block
thanos[25601]: ts=2024-07-16T06:48:46.175630874Z caller=intrumentation.go:67 level=warn msg="changing probe status" status=not-ready reason="critical error detected: compaction: group 0@18096438094995422812: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1721088000000, maxt: 1721095200000, range: 2h0m0s, blocks: 2]: <ulid: 01J2WGZFQBHM1MDE2F60ZZQGJY, mint: 1721088000000, maxt: 1721095200000, range: 2h0m0s>, <ulid: 01J2WQQ4CJ9J721M98KRXP14DZ, mint: 1721088000000, maxt: 1721095200000, range: 2h0m0s>"

Other things

Relevant Prometheus config:

storage:
  tsdb:
    out_of_order_time_window: 1h

Prometheus command line:

prometheus \
    --config.file=... \
    --storage.tsdb.path=... \
    --storage.tsdb.min-block-duration=2h \
    --storage.tsdb.max-block-duration=2h \
    --storage.tsdb.retention.time=1d \
    --web.console.templates=... \
    --web.console.libraries=... \
    --web.external-url=... \
    --web.enable-remote-write-receiver \
    --web.enable-lifecycle \
    --log.level=debug

Sidecar command line:

thanos sidecar \
    --tsdb.path ... \
    --objstore.config-file ... \
    --prometheus.url ... \
    --http-address ... \
    --grpc-address ... \
    --grpc-server-tls-cert ... \
    --grpc-server-tls-key ...

Compact command line:

thanos compact \
    --data-dir ... \
    --objstore.config-file ... \
    --http-address ... \
    --wait \
    --retention.resolution-raw=30d \
    --retention.resolution-5m=180d \
    --retention.resolution-1h=365d \
    --delete-delay=30m \
    --no-debug.halt-on-error \
    --block-files-concurrency=2
kennylevinsen commented 1 month ago

Relates to:

On the Thanos Receive side, the solution has been to disable all local compaction and in turn require Thanos Compact to have vertical compaction enabled. On the Thanos Sidecar side, there are suggestions for allowing upload of compacted chunks and to enable vertical compaction to solve the duplication.

None of this is documented though, and vertical compaction on the thanos compactor side is still experimental with open bugs related to rate/irate and a big warning in the manual.

One hacky solution for sidecar would be to enable compacted block upload, but add an option to delay block upload. By delaying past the out-of-order window and the block max time, we would be able to:

  1. Know for sure if a range has overlaps by just inspecting of multiple blocks for the same range is present at the time of upload.
  2. Reject uploading blocks with overlap, instead waiting for the compacted block.

The delay would have to be larger than the max time + out of order time window, and smaller than the prometheus retention time as store would be further behind. If prometheus is crashed, there could still be an old overlap with something in the wal/wbl. Skipping upload of very old blocks, and the thanos compact skip overlap option might be enough for that.

Another (stupid) solution would be to have a way to make Prometheus always compact, and then only upload compacted blocks.

Of course, the best solution would just be to have thanos compact do the vertical compaction without rate issues...

yeya24 commented 1 month ago

vertical compaction on the thanos compactor side is still experimental with open bugs related to rate/irate and a big warning in the manual

Can you please remind me what's the open issue about rate/irate? I am aware of such issues of downsampling but nothing about vertical compaction as Thanos vertical compaction is the same as what Prometheus does.

We should mark vertical compaction as non experimental I think. Thanks for this, we should update docs.

I think we can just enable the same configuration in Prometheus so Prometheus disables overlap compaction locally and compactor will handle it https://github.com/prometheus/prometheus/issues/13112

kennylevinsen commented 1 month ago

Can you please remind me what's the open issue about rate/irate? I am aware of such issues of downsampling but nothing about vertical compaction as Thanos vertical compaction is the same as what Prometheus does.

From the Thanos docs "Vertical Compaction Risks" section: https://github.com/thanos-io/thanos/issues/2890. It's a quite old issue with the occasional ping, making the current situation unclear, but being left in the documentation does make it seem to end-users should still be cautious.

Making Vertical Compaction non-experimental and recommended at least for this use-case, in conjunction with a new Prometheus flag to mimic the new receiver behavior sounds good to me.

yeya24 commented 1 month ago

From the Thanos docs "Vertical Compaction Risks" section: https://github.com/thanos-io/thanos/issues/2890. It's a quite old issue with the occasional ping, making the current situation unclear, but being left in the documentation does make it seem to end-users should still be cautious.

Got it. If it is this rate bug then it is specific to the penalty deduplication mode and it shouldn't happen for the default 1:1 deduplication in vertical sharding.

For OOO blocks handled by compactor with vertical compaction, as long as penalty deduplication is not enabled (shouldn't as well since penalty dedup is for HA) I don't see any risks.

zoglam commented 1 month ago

Got it. If it is this rate bug then it is specific to the penalty deduplication mode and it shouldn't happen for the default 1:1 deduplication in vertical sharding.

For OOO blocks handled by compactor with vertical compaction, as long as penalty deduplication is not enabled (shouldn't as well since penalty dedup is for HA) I don't see any risks.

Describing the example situation

HA pair of Prometheus sends data through thanos-sidecar. The compactor is operating in "penalty dedup" mode. Later on, the system expands and remote_write + OOO functionality is added to Prometheus. (According to the documentation, different modes are used for the two scenarios: one-to-one for receivers, penalty for HA pair). HA pair + remote_write + OOO WITH penalty dedup

After deduplication, there is a reduction in the counter metrics that were written via remote_write to HA pair prometheus. In one-to-one deduplication mode, everything falls apart. 1 What discussions will there be? 2 Why can't HA pair with penalty dedup mode and OOO be done?

yeya24 commented 1 month ago

Hey @zoglam, sorry for the late reply. I think I got your point now... The merge between regular and OOO blocks should use one-to-one deduplication. And the deduplication between HA pairs should use penalty mode So in this case, it would require one-to-one deduplication first then penalty. I think this is not something we support atm.

I think deploying two set of compactors could help and they use different strategies. But I don't think we have a way to only compact blocks matching the target strategy. I will think about it more and see if there is a better way to do it.

yeya24 commented 1 month ago

Created issue https://github.com/prometheus-operator/prometheus-operator/issues/6829 on operator to track the idea describe in https://github.com/thanos-io/thanos/issues/7551#issuecomment-2258784742. The flag is already added to Prometheus