thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.01k stars 2.09k forks source link

compactor: no downsampling, why? How to enable/activate/implement downsampling? #6866

Closed Kiara0107 closed 10 months ago

Kiara0107 commented 11 months ago

Thanos, Prometheus and Golang version used:

Object Storage Provider: S3 - Wasabi

What happened: no downsampling, but also no error Metric: thanos_compact_todo_downsample_blocks = 0 (flatliner) Bucket_store_block_series tag block.resolution = 0 (always..)

What you expected to happen: since there is data for the last 10 months I would expect downsampled blocks.

How to reproduce it (as minimally and precisely as possible): Prometheus docker run:

docker run -d -p 9090:9090 \
      -v /etc/prometheus:/prometheus/ \
      -u root \
      --restart unless-stopped \
      --network prometheusnet \
      --name prometheus prom/prometheus:latest \
      --config.file=/prometheus/prometheus.yml \
      --storage.tsdb.path=/prometheus \
      --storage.tsdb.min-block-duration=2h \
      --storage.tsdb.max-block-duration=2h \
      --storage.tsdb.retention.time=3d \
      --web.enable-lifecycle \
      --web.enable-admin-api

Thanos compact:

docker run -d -p 19191:19191 \
      -v /etc/thanos-store/storage-bucket.yml:/storage.yml \
      -v /etc/thanos-store/tracing-config-compact.yml:/tracing-config.yml \
      -u root  \
      --restart unless-stopped \
      --network prometheusnet \
      --name thanos-compact thanosio/thanos:v0.32.5 compact \
      --data-dir /var/thanos/compact \
      --objstore.config-file storage.yml \
      --http-address 0.0.0.0:19191 \
      --wait \
      --retention.resolution-raw=30d  \
      --retention.resolution-5m=90d \
      --retention.resolution-1h=730d \
      --tracing.config-file=/tracing-config.yml

Full logs to relevant components: N/A there are no errors.. or mentions..

Anything else we need to know: I've read the docs on https://thanos.io/tip/components/compact.md/#downsampling, https://thanos.io/tip/operating/compactor-backlog.md/ and https://thanos.io/tip/components/sidecar.md/ But still I'm clueless on how to enable downsampling.

fpetkovski commented 11 months ago

Can you post a screenshot of the block UI from the compactor? You can access it on the http port which you've set to be 19191

Kiara0107 commented 11 months ago

Yes of course. This is what you expected? image

fpetkovski commented 11 months ago

So downsampling happens only for blocks that are 2 days or 14 days in duration. Your bucket seems to have many 2 hour blocks, which is probably because no compaction is taking place. Do you see any logs related to compaction?

Kiara0107 commented 11 months ago

Only at startup "start compactions"

ts=2023-11-02T09:06:31.990755613Z caller=factory.go:43 level=info msg="loading tracing configuration"
ts=2023-11-02T09:06:31.999074869Z caller=factory.go:53 level=info msg="loading bucket configuration"
ts=2023-11-02T09:06:32.002751138Z caller=compact.go:393 level=info msg="retention policy of raw samples is enabled" duration=720h0m0s
ts=2023-11-02T09:06:32.003106545Z caller=compact.go:400 level=info msg="retention policy of 5 min aggregated samples is enabled" duration=2160h0m0s
ts=2023-11-02T09:06:32.003350049Z caller=compact.go:403 level=info msg="retention policy of 1 hour aggregated samples is enabled" duration=17520h0m0s
ts=2023-11-02T09:06:32.00978147Z caller=compact.go:643 level=info msg="starting compact node"
ts=2023-11-02T09:06:32.010126677Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready
ts=2023-11-02T09:06:32.01245292Z caller=intrumentation.go:75 level=info msg="changing probe status" status=healthy
ts=2023-11-02T09:06:32.012810827Z caller=http.go:73 level=info service=http/server component=compact msg="listening for requests and metrics" address=0.0.0.0:19191
ts=2023-11-02T09:06:32.015919185Z caller=tls_config.go:274 level=info service=http/server component=compact msg="Listening on" address=[::]:19191
ts=2023-11-02T09:06:32.015973286Z caller=tls_config.go:277 level=info service=http/server component=compact msg="TLS is disabled." http2=false address=[::]:19191
ts=2023-11-02T09:06:32.016096889Z caller=compact.go:1414 level=info msg="start sync of metas"
ts=2023-11-02T09:06:52.820841212Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=20.810256927s duration_ms=20810 cached=7100 returned=7100 partial=0
ts=2023-11-02T09:07:01.338317648Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=29.322172459s duration_ms=29322 cached=7100 returned=7098 partial=0
ts=2023-11-02T09:07:01.341844114Z caller=compact.go:1419 level=info msg="start of GC"
ts=2023-11-02T09:07:01.851740062Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=9.028020496s duration_ms=9028 cached=7100 returned=7100 partial=0
ts=2023-11-02T09:07:02.851016973Z caller=compact.go:1442 level=info msg="start of compactions"

after that it only does cleaning and deletions?

ts=2023-11-02T10:21:42.402026159Z caller=blocks_cleaner.go:44 level=info msg="started cleaning of blocks marked for deletion"
ts=2023-11-02T10:21:42.402199762Z caller=blocks_cleaner.go:58 level=info msg="cleaning of blocks marked for deletion done"
ts=2023-11-02T10:21:52.887361133Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.486360492s duration_ms=10486 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:21:58.83169436Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=6.007858343s duration_ms=6007 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:22:59.439970764Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=6.61612035s duration_ms=6616 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:23:57.770258898Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=4.946454995s duration_ms=4946 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:24:57.216771043Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=4.392861564s duration_ms=4392 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:25:59.562012347Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=6.738154623s duration_ms=6738 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:26:42.519282273Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.502378241s duration_ms=10502 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:26:42.522295526Z caller=clean.go:34 level=info msg="started cleaning of aborted partial uploads"
ts=2023-11-02T10:26:42.522346827Z caller=clean.go:61 level=info msg="cleaning of aborted partial uploads done"
ts=2023-11-02T10:26:42.522375327Z caller=blocks_cleaner.go:44 level=info msg="started cleaning of blocks marked for deletion"
ts=2023-11-02T10:26:42.522398727Z caller=blocks_cleaner.go:58 level=info msg="cleaning of blocks marked for deletion done"
ts=2023-11-02T10:26:52.796659158Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.27421753s duration_ms=10274 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:26:58.85392721Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=6.029460663s duration_ms=6029 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:27:57.127456714Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=4.302953196s duration_ms=4302 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:29:00.597771191Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7.773920663s duration_ms=7773 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:29:57.047962512Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=4.224017514s duration_ms=4224 cached=7100 returned=7100 partial=0
ts=2023-11-02T10:30:57.357907077Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=4.534037372s duration_ms=4534 cached=7100 returned=7100 partial=0

To-do compactions is doing something but stays very high, maybe it's busy? (I thought at least.. so I increased the concurrency from 1 to 6.. but don't see much happening.. image

fpetkovski commented 11 months ago

I am not sure if any compaciton is taking place. I think that pending values go down due to retention, and not because compactions are completed.

Kiara0107 commented 11 months ago

Any idea on how to check?

PabloPie commented 11 months ago

We seem to be having a similar issue since we upgraded to v0.32.X from v0.31. No error logs at all but we can see that the metric thanos_compact_group_compactions_total has been 0 since the upgrade, while thanos_compact_todo_compactions keeps increasing.

yeya24 commented 11 months ago

We have a guide on how to troubleshoot this problem https://thanos.io/tip/operating/compactor-backlog.md/#troubleshoot-compactor-backlog

Updated: nvm you already mentioned you read the doc. Downsampling is enabled by default. You can scale up compactors to catch up.

@Kiara0107 Can you check thanos_compact_todo_compactions metrics first? This metric need to be 0. Downsampling only happens after all compactions finish

One possibility of you have thanos_compact_todo_downsample_blocks to 0 is because your compaction is slow and your compacted blocks are not large enough to be able to downsample.

Kiara0107 commented 11 months ago

Yes it looks like there is no compaction done. I increased the concurrency now to 20 and see the MEM of the container increasing a lot. But other then that, no changes. image

image

fpetkovski commented 11 months ago

Could you post a graph of bucket operations?

Also, this could be an bug introduced in 0.32 as noted in another issue. Could you try 0.31 to see if it makes a difference?

Kiara0107 commented 11 months ago

image Will revert to 0.31

Kiara0107 commented 11 months ago

I think something is corrupted now, I face the "critical error detected; halting" message in the log (way to long to completely paste here)

<ulid: 01GQXYB0R6P8S474P4SV94YBFD, mint: 1674957600172, maxt: 1674964800000, range: 1h59m59s>, <ulid: 01GQXYB0JHSS2M71HRS53A8BW4, mint: 1674957600281, maxt: 1674964800000, range: 1h59m59s>\n[mint: 1680350400377, maxt: 1680357600000, range: 1h59m59s, blocks: 2]: <ulid: 01GWYNA6XTHQFFMNK2FC3AVG6Z, mint: 1680350400326, maxt: 1680357600000, range: 1h59m59s>, <ulid: 01GWYNA7DNA31Y15AS49DD3VD0, mint: 1680350400377, maxt: 1680357600000, range: 1h59m59s>\n[mint: 1680926400378, maxt: 1680933600000, range: 1h59m59s, blocks: 2]: <ulid: 01GXFTMDFRTAVKDV71DNWY02SC, mint: 1680926400152, maxt: 1680933600000, range: 1h59m59s>, <ulid: 01GXFTMBB6HJPERPCF46HEMGBB, mint: 1680926400378, maxt: 1680933600000, range: 1h59m59s>\n[mint: 1681322400368, maxt: 1681329600000, range: 1h59m59s, blocks: 2]: <ulid: 01GXVM9D42W1H80QBRER6BNGX8, mint: 1681322400076, maxt: 1681329600000, range: 1h59m59s>, <ulid: 01GXVM99TS33NJ0QXH6PN7H9AJ, mint: 1681322400368, maxt: 1681329600000, range: 1h59m59s>\n[mint: 1684274400160, maxt: 1684281600000, range: 1h59m59s, blocks: 2]: <ulid: 01H0KKH8V94HE6VWA17WMHNJHH, mint: 1684274400088, maxt: 1684281600000, range: 1h59m59s>, <ulid: 01H0KKHDZRSEBFNW3GNE6H1022, mint: 1684274400160, maxt: 1684281600000, range: 1h59m59s>\n[mint: 1684987200159, maxt: 1684994400000, range: 1h59m59s, blocks: 2]: <ulid: 01H18VA6JVM5A4NE2A6J4KEC04, mint: 1684987200088, maxt: 1684994400000, range: 1h59m59s>, <ulid: 01H18VABQGEQTQ2B8WCTP2KQ5Z, mint: 1684987200159, maxt: 1684994400000, range: 1h59m59s>\n[mint: 1671890400530, maxt: 1671897600000, range: 1h59m59s, blocks: 2]: <ulid: 01GN2H7FQ92R900TNQK0GF7959, mint: 1671890400085, maxt: 1671897600000, range: 1h59m59s>, <ulid: 01GN2H7HWPS8CXJBNPS5CTPGQR, mint: 1671890400530, maxt: 1671897600000, range: 1h59m59s>\n[mint: 1679169600377, maxt: 1679176800000, range: 1h59m59s, blocks: 2]: <ulid: 01GVVF71XSV8HE90NDJXW48P91, mint: 1679169600326, maxt: 1679176800000, range: 1h59m59s>, <ulid: 01GVVF72DK4ZEBNY1AD8BEF2HS, mint: 1679169600377, maxt: 1679176800000, range: 1h59m59s>\n[mint: 1686038400159, maxt: 1686045600000, range: 1h59m59s, blocks: 2]: <ulid: 01H285T926YEVYDG6XGHY73038, mint: 1686038400088, maxt: 1686045600000, range: 1h59m59s>, <ulid: 01H285TE7PBC2AQK9S0MG22RR6, mint: 1686038400159, maxt: 1686045600000, range: 1h59m59s>\n[mint: 1677088800344, maxt: 1677096000000, range: 1h59m59s, blocks: 2]: <ulid: 01GSXET2F9MMB3TME5BDCRY0V0, mint: 1677088800118, maxt: 1677096000000, range: 1h59m59s>, <ulid: 01GSXET2FG8KY4EW5XBYHXGKE3, mint: 1677088800344, maxt: 1677096000000, range: 1h59m59s>\n[mint: 1678233600392, maxt: 1678240800000, range: 1h59m59s, blocks: 2]: <ulid: 01GTZJJM4F827051J0MDGGTY3W, mint: 1678233600283, maxt: 1678240800000, range: 1h59m59s>, <ulid: 01GTZJJKAVF651GMNMBRMC26NN, mint: 1678233600392, maxt: 1678240800000, range: 1h59m59s>\n[mint: 1684706400160, maxt: 1684713600000, range: 1h59m59s, blocks: 2]: <ulid: 01H10FGVTAY6WB9RDHD164E4HV, mint: 1684706400088, maxt: 1684713600000, range: 1h59m59s>, <ulid: 01H10FH0Z6RWZ64VM4V1F60TJA, mint: 1684706400160, maxt: 1684713600000, range: 1h59m59s>\n[mint: 1682524800160, maxt: 1682532000000, range: 1h59m59s, blocks: 2]: <ulid: 01GYZEZQ2CQ0D9FQTNWK2ASREM, mint: 1682524800075, maxt: 1682532000000, range: 1h59m59s>, <ulid: 01GYZEZW6Y79AMW7GSTKR1HNS3, mint: 1682524800160, maxt: 1682532000000, range: 1h59m59s>\n[mint: 1684123200160, maxt: 1684130400000, range: 1h59m59s, blocks: 2]: <ulid: 01H0F3B0J776N5QQRVBERAQ6Y2, mint: 1684123200088, maxt: 1684130400000, range: 1h59m59s>, <ulid: 01H0F3B5PSG4Z77HARSGYA6Y82, mint: 1684123200160, maxt: 1684130400000, range: 1h59m59s>\n[mint: 1686254400159, maxt: 1686261600000, range: 1h59m59s, blocks: 2]: <ulid: 01H2EKT2J8VY2DW7T65ZVJX46R, mint: 1686254400088, maxt: 1686261600000, range: 1h59m59s>, <ulid: 01H2EKSZWGTEP2X1F1YQQB5ARX, mint: 1686254400159, maxt: 1686261600000, range: 1h59m59s>\n[mint: 1680408000377, maxt: 1680415200000, range: 1h59m59s, blocks: 2]: <ulid: 01GX0C80XS26E75QAT7FJG3WCP, mint: 1680408000326, maxt: 1680415200000, range: 1h59m59s>, <ulid: 01GX0C81DWE3HFHVS7P9NK7BF2, mint: 1680408000377, maxt: 1680415200000, range: 1h59m59s>\n[mint: 1680480000377, maxt: 1680487200000, range: 1h59m59s, blocks: 2]: <ulid: 01GX2GX9DT4SJ471NJ96M4RWG6, mint: 1680480000326, maxt: 1680487200000, range: 1h59m59s>, <ulid: 01GX2GX9XHFEEY7TD5Z9KH6T99, mint: 1680480000377, maxt: 1680487200000, range: 1h59m59s>\n[mint: 1686830400160, maxt: 1686837600000, range: 1h59m59s, blocks: 2]: <ulid: 01H2ZS49GFKFGB7GA0KM8KSGYA, mint: 1686830400075, maxt: 1686837600000, range: 1h59m59s>, <ulid: 01H2ZS4BBH6TXCZ8N0ZQHBBV21, mint: 1686830400160, maxt: 1686837600000, range: 1h59m59s>\n[mint: 1675353600196, maxt: 1675360800000, range: 1h59m59s, blocks: 2]: <ulid: 01GR9QZZ1E96646TQ477ENXANZ, mint: 1675353600035, maxt: 1675360800000, range: 1h59m59s>, <ulid: 01GR9QZZX40MP82DAW4QMPF235, mint: 1675353600196, maxt: 1675360800000, range: 1h59m59s>\n[mint: 1680127200377, maxt: 1680134400000, range: 1h59m59s, blocks: 2]: <ulid: 01GWR0EP5VG4YA7SBPSTXVCZ0C, mint: 1680127200326, maxt: 1680134400000, range: 1h59m59s>, <ulid: 01GWR0EPPAXB0CSXS545HHBZ5Y, mint: 1680127200377, maxt: 1680134400000, range: 1h59m59s>\n[mint: 1676102400344, maxt: 1676109600000, range: 1h59m59s, blocks: 2]: <ulid: 01GS023H6Y0YERN94QQCET10E9, mint: 1676102400118, maxt: 1676109600000, range: 1h59m59s>, <ulid: 01GS023H7G57Q5BKKGKKSYD3X4, mint: 1676102400344, maxt: 1676109600000, range: 1h59m59s>\n[mint: 1671256800529, maxt: 1671264000000, range: 1h59m59s, blocks: 2]: <ulid: 01GMFMZHWPKA1HXVY58PF203TB, mint: 1671256800121, maxt: 1671264000000, range: 1h59m59s>, <ulid: 01GMFMZKSY2Y97PA39TAM1G462, mint: 1671256800529, maxt: 1671264000000, range: 1h59m59s>\n[mint: 1674583200281, maxt: 1674590400000, range: 1h59m59s, blocks: 2]: <ulid: 01GQJS97R6BZ90FMVXAB69KEJ4, mint: 1674583200172, maxt: 1674590400000, range: 1h59m59s>, <ulid: 01GQJS97JHKQHSN0TDZCX8HD27, mint: 1674583200281, maxt: 1674590400000, range: 1h59m59s>"
PabloPie commented 11 months ago

Similar log for me actually: populate block: chunk iter: cannot populate chunk 8 from block 01HA60TQ55Z0DRWHW5ZAD51AYB: segment index 0 out of range","level":"error","msg":"critical error detected; halting"

I guess a problem with a single block is halting all compaction?

Kiara0107 commented 11 months ago

Mine is slighty different, starts with: level=error ts=2023-11-03T10:06:38.234649615Z caller=compact.go:487 msg="critical error detected; halting" err="compaction: 4 errors: group 0@4841334735344211667: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1687615201155, maxt: 1687622400000, range: 1h59m58s, blocks: 2]: <ulid: 01H3Q5JEPZENCHNYBTXM3VFN3K, mint: 1687615200022, maxt: 1687622400000, range: 1h59m59s>, <ulid: 01H3Q5JY099D8Q5KT1150AM8WP, mint: 1687615201155, maxt: 1687622400000, range: 1h59m58s>\n[mint: 1688220000064, maxt: 1688227200000, range: 1h59m59s, blocks: 2]: <ulid: 01H496BFGRS2JKTWPAA9R0X1P9, mint: 1688220000049, maxt: 1688227200000, range: 1h59m59s>, <ulid: 01H496BFKGTRZKNAJGY6C67236, mint: 1688220000064, maxt: 1688227200000, range: 1h59m59s>\n[mint: 1688479200064, maxt: 1688486400000, r

Edit: There are over 1500 blocks mentioned, tried to log on the the container and use the bucket tools, but without success:

user@host:~$ docker exec -it thanos-compact /bin/sh
/ # thanos tools bucket verify --repair
level=error ts=2023-11-03T10:48:32.108426793Z caller=main.go:135 err="flag objstore.config-file or objstore.config is required for running this command and content cannot be empty.\ngithub.com/efficientgo/tools/extkingpin.(*PathOrContent).Content\n\t/go/pkg/mod/github.com/efficientgo/tools/extkingpin@v0.0.0-20220817170617-6c25e3b627dd/pathorcontent.go:87\nmain.registerBucketVerify.func1\n\t/app/cmd/thanos/tools_bucket.go:297\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\npreparing tools bucket verify command failed\nmain.main\n\t/app/cmd/thanos/main.go:135\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"
/ #

Any suggestions on how to fix this 'overlap' error? Manually removing buckets from S3 doesn't feel like te way to go :S

PabloPie commented 11 months ago

It seems my compactor is now making progress. Apparently, the compactor had been running for weeks doing nothing after it encountered an error with a block (Our thanos compactor dashboard wasn't showing the most critical metric: thanos_compact_halted). I also enabled the flag suggested here, which only made the error worse, because it only skipped logging the error, not the block.

The only solution I could think of is manually marking the failing blocks for non-compaction with thanos tools bucket mark --id=01HA0N5HY0AWK2K75E7Z121JC8 --marker=no-compact-mark.json. Not sure if that can help you @Kiara0107.

fpetkovski commented 11 months ago

@Kiara0107 do you have replicated blocks, e.g. coming from prometheus pairs?

Kiara0107 commented 11 months ago

@fpetkovski yes I have. We have one Prometheus pair running

global:
  scrape_interval: 30s
  external_labels:
    prometheus: 'COLO06'
    prometheus_replica: 'observer1'

and

global:
  scrape_interval: 30s
  external_labels:
    prometheus: 'COLO06'
    prometheus_replica: 'observer2'
fpetkovski commented 11 months ago

In this case you should use prometheus_replica as the replica label on the compactor and enable vertical compaction.

Kiara0107 commented 11 months ago

My hero. Looks like that indeed is helping, Thanos-compact logs shows:

ts=2023-11-03T12:42:49.103223767Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=11.914014578s duration_ms=11914 cached=7128 returned=7128 partial=0
ts=2023-11-03T12:43:02.408794746Z caller=compact.go:1221 level=info group="0@{cluster=\"OTA\", environment=\"OTA\", replica=\"A\"}" groupKey=0@4841334735344211667 msg="uploaded block" result_block=01HEAK7YP9YAG8JQTTX56ZF85N duration=7.97653548s duration_ms=7976
ts=2023-11-03T12:43:02.424905239Z caller=compact.go:1256 level=info group="0@{cluster=\"OTA\", environment=\"OTA\", replica=\"A\"}" groupKey=0@4841334735344211667 msg="marking compacted block for deletion" old_block=01H3HK1PJBE1PDZ8DEDZ35ENFE
ts=2023-11-03T12:43:02.469492251Z caller=block.go:203 level=info group="0@{cluster=\"OTA\", environment=\"OTA\", replica=\"A\"}" groupKey=0@4841334735344211667 msg="block has been marked for deletion" block=01H3HK1PJBE1PDZ8DEDZ35ENFE
ts=2023-11-03T12:43:02.503083162Z caller=compact.go:1256 level=info group="0@{cluster=\"OTA\", environment=\"OTA\", replica=\"A\"}" groupKey=0@4841334735344211667 msg="marking compacted block for deletion" old_block=01H3HK1ZVE5QF4A3VGZHMT17KD
ts=2023-11-03T12:43:02.633432335Z caller=block.go:203 level=info group="0@{cluster=\"OTA\", environment=\"OTA\", replica=\"A\"}" groupKey=0@4841334735344211667 msg="block has been marked for deletion" block=01H3HK1ZVE5QF4A3VGZHMT17KD
ts=2023-11-03T12:43:02.633870842Z caller=compact.go:1236 level=info group="0@{cluster=\"OTA\", environment=\"OTA\", replica=\"A\"}" groupKey=0@4841334735344211667 msg="running post compaction callback" result_block=01HEAK7YP9YAG8JQTTX56ZF85N
ts=2023-11-03T12:43:02.634194448Z caller=compact.go:1240 level=info group="0@{cluster=\"OTA\", environment=\"OTA\", replica=\"A\"}" groupKey=0@4841334735344211667 msg="finished running post compaction callback" result_block=01HEAK7YP9YAG8JQTTX56ZF85N
ts=2023-11-03T12:43:02.634500054Z caller=compact.go:1242 level=info group="0@{cluster=\"OTA\", environment=\"OTA\", replica=\"A\"}" groupKey=0@4841334735344211667 msg="finished compacting blocks" result_block=01HEAK7YP9YAG8JQTTX56ZF85N source_blocks="[/var/thanos/compact/compact/0@4841334735344211667/01H3HK1PJBE1PDZ8DEDZ35ENFE /var/thanos/compact/compact/0@4841334735344211667/01H3HK1ZVE5QF4A3VGZHMT17KD]" duration=2m14.217757014s duration_ms=134217
ts=2023-11-03T12:43:45.72841021Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=8.526887235s duration_ms=8526 cached=7129 returned=7129 partial=0

Which is for sure different then before. I will leave it to it now and see if compaction continues and downsampling will be activated. Thnx for now :)

FYI I've added these 2 flags to the docker run command: --deduplication.func="" \ --deduplication.replica-label="prometheus_replica" \

yeya24 commented 10 months ago

Since it is a configuration issue, I will close this one.

We should update compactor backlog troubleshooting doc to mention checking compactor halt first.

https://github.com/thanos-io/thanos/pull/6906

jakuboskera commented 6 months ago

In this case you should use prometheus_replica as the replica label on the compactor and enable vertical compaction.

So everytime when I have two or more replicas of prometheus scraping same targets (same cluster) distinguished by label prometheus_replica which has different value, I should add these two params to compactor?

- --deduplication.func=
- --deduplication.replica-label=prometheus_replica

Or what is the best and recommended way how to setup compactor when I have HA prometheus in the same cluster?

Thanks