thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.09k stars 2.1k forks source link

Compactor only get meta info, does not compress, and not down sampling #5307

Closed ChaoHsupin closed 2 years ago

ChaoHsupin commented 2 years ago

Thanos, Prometheus and Golang version used: thanos: v0.21.1

Object Storage Provider: cos What happened: Compactor only get meta info, does not compress, and not down sampling. k8s deploy args:

containers:
    - name: thanos-compactor
      image: registry.qunhequnhe.com/monitor/thanos:v0.21.1
      args:
        - compact
        - '--wait'
        - '--log.level=info'
        - '--retention.resolution-raw=30d'
        - '--retention.resolution-5m=60d'
        - '--retention.resolution-1h=0d'
        - '--compact.concurrency=3'
        - '--data-dir=/var/thanos/compactor/thanos-compactor-01'
        - '--objstore.config=$(OBJSTORE_CONFIG)'

container logs, I can't see a log about starting compression and downsampling:

level=info ts=2022-04-27T17:02:44.949607703Z caller=factory.go:46 msg="loading bucket configuration"
level=info ts=2022-04-27T17:02:44.964524143Z caller=compact.go:363 msg="retention policy of raw samples is enabled" duration=720h0m0s
level=info ts=2022-04-27T17:02:44.964635529Z caller=compact.go:366 msg="retention policy of 5 min aggregated samples is enabled" duration=1440h0m0s
level=info ts=2022-04-27T17:02:44.966728779Z caller=compact.go:537 msg="starting compact node"
level=info ts=2022-04-27T17:02:44.966762225Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2022-04-27T17:02:44.96689767Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2022-04-27T17:02:44.966985779Z caller=http.go:63 service=http/server component=compact msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2022-04-27T17:02:44.967136134Z caller=compact.go:942 msg="start sync of metas"
level=info ts=2022-04-27T17:02:44.967450242Z caller=tls_config.go:191 service=http/server component=compact msg="TLS is disabled." http2=false
level=info ts=2022-04-27T17:03:13.861835684Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=28.89419327s cached=15109 returned=15109 partial=2
level=info ts=2022-04-27T17:03:24.515633531Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=39.548423197s cached=15109 returned=3596 partial=2
level=info ts=2022-04-27T17:03:24.547571352Z caller=compact.go:947 msg="start of GC"
level=info ts=2022-04-27T17:03:24.681431916Z caller=compact.go:970 msg="start of compactions"
level=info ts=2022-04-27T17:03:24.681781405Z caller=compact.go:713 group="0@{idc_name=\"hailaer\", promCluster=\"k8s-hailaer\"}" groupKey=0@13427109783564785407 msg="compaction available and planned; downloading blocks" plan="[01G0VJVWYPZY6Q3RP1DRD8K7SH (min time: 1650182400000, max time: 1650189600000) 01G0VSQM4FCM87E2GKEGJ8B4QF (min time: 1650189600000, max time: 1650196800000) 01G0W0KBCFT9ETQ8VMESJJD5A1 (min time: 1650196800000, max time: 1650204000000) 01G0W7F2PM2GHEWB6937N6TQWH (min time: 1650204000000, max time: 1650211200000)]"
level=info ts=2022-04-27T17:03:24.681898535Z caller=compact.go:713 group="0@{idc_name=\"xiasha\", promCluster=\"k8s-xiasha\"}" groupKey=0@10665639288258739055 msg="compaction available and planned; downloading blocks" plan="[01G0Q9HBWEDKAEXEWDZ0D5SCMP (min time: 1650038400000, max time: 1650045600000) 01G0QGD34B1QFDF4TX7Y3CBAHE (min time: 1650045600000, max time: 1650052800000) 01G0QQ8TCH8D1AN2E4NBNRKXAJ (min time: 1650052800000, max time: 1650060000000) 01G0QY4HMFQX1247VRXXPY67SD (min time: 1650060000000, max time: 1650067200000)]"
level=info ts=2022-04-27T17:03:24.683094765Z caller=compact.go:713 group="0@{idc_name=\"txyun\", promCluster=\"faros-txyun-15\"}" groupKey=0@15087332170219289486 msg="compaction available and planned; downloading blocks" plan="[01G07V52DT2KV7JAGSAVMK1TBF (min time: 1649520000102, max time: 1649527200000) 01G0820SNSQQ2E74RMRGR56XVN (min time: 1649527200101, max time: 1649534400000) 01G088WGXSCGCV6YN2Q3YKBW8G (min time: 1649534400100, max time: 1649541600000) 01G08FR7QD76109QX60PMT0Y8X (min time: 1649541600100, max time: 1649548800000)]"
level=info ts=2022-04-27T17:03:29.286056081Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=15.18043799s cached=15109 returned=15109 partial=3
level=info ts=2022-04-27T17:03:37.231735253Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=12.673422926s cached=15109 returned=3596 partial=3
level=info ts=2022-04-27T17:03:37.27633699Z caller=clean.go:33 msg="started cleaning of aborted partial uploads"
level=info ts=2022-04-27T17:03:37.276374632Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
level=info ts=2022-04-27T17:03:37.276407673Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
level=info ts=2022-04-27T17:03:37.276443791Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
level=info ts=2022-04-27T17:04:26.644358139Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=12.53855851s cached=15112 returned=15112 partial=2
level=info ts=2022-04-27T17:05:26.567878028Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=12.461535131s cached=15113 returned=15113 partial=2
level=info ts=2022-04-27T17:06:26.478679199Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=12.372962696s cached=15115 returned=15115 partial=2
level=info ts=2022-04-27T17:07:26.313583141Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=12.207127851s cached=15115 returned=15115 partial=3
level=warn ts=2022-04-27T17:07:40.356165744Z caller=objstore.go:277 group="0@{idc_name=\"xiasha\", promCluster=\"k8s-xiasha\"}" groupKey=0@10665639288258739055 msg="failed to remove file on partial dir download error" file=/var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G0Q9HBWEDKAEXEWDZ0D5SCMP err="remove /var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G0Q9HBWEDKAEXEWDZ0D5SCMP: directory not empty"
level=warn ts=2022-04-27T17:07:53.58799131Z caller=objstore.go:277 group="0@{idc_name=\"hailaer\", promCluster=\"k8s-hailaer\"}" groupKey=0@13427109783564785407 msg="failed to remove file on partial dir download error" file=/var/thanos/compactor/thanos-compactor-01/compact/0@13427109783564785407/01G0VJVWYPZY6Q3RP1DRD8K7SH err="remove /var/thanos/compactor/thanos-compactor-01/compact/0@13427109783564785407/01G0VJVWYPZY6Q3RP1DRD8K7SH: directory not empty"
level=info ts=2022-04-27T17:08:05.777369526Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=20.809750901s cached=15115 returned=3596 partial=3
level=info ts=2022-04-27T17:08:05.842788558Z caller=clean.go:33 msg="started cleaning of aborted partial uploads"
level=info ts=2022-04-27T17:08:05.84284338Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
level=info ts=2022-04-27T17:08:05.842862019Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
level=info ts=2022-04-27T17:08:05.842889092Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
level=info ts=2022-04-27T17:08:25.999476861Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=11.892776889s cached=15116 returned=15116 partial=2
level=info ts=2022-04-27T17:09:26.304404957Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=12.197821912s cached=15117 returned=15117 partial=1
level=warn ts=2022-04-27T17:10:18.889768721Z caller=objstore.go:277 group="0@{idc_name=\"txyun\", promCluster=\"faros-txyun-15\"}" groupKey=0@15087332170219289486 msg="failed to remove file on partial dir download error" file=/var/thanos/compactor/thanos-compactor-01/compact/0@15087332170219289486/01G07V52DT2KV7JAGSAVMK1TBF err="remove /var/thanos/compactor/thanos-compactor-01/compact/0@15087332170219289486/01G07V52DT2KV7JAGSAVMK1TBF: directory not empty"
level=error ts=2022-04-27T17:10:18.889881745Z caller=compact.go:469 msg="retriable error" err="compaction: 3 errors: group 0@10665639288258739055: download block 01G0Q9HBWEDKAEXEWDZ0D5SCMP: copy object to file: unexpected EOF; group 0@13427109783564785407: download block 01G0VJVWYPZY6Q3RP1DRD8K7SH: copy object to file: unexpected EOF; group 0@15087332170219289486: download block 01G07V52DT2KV7JAGSAVMK1TBF: copy object to file: unexpected EOF"
level=info ts=2022-04-27T17:10:18.88994413Z caller=compact.go:942 msg="start sync of metas"
level=info ts=2022-04-27T17:10:25.982551935Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=11.87639421s cached=15118 returned=15118 partial=1
level=info ts=2022-04-27T17:10:33.708094103Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=14.818129403s cached=15118 returned=3596 partial=1
level=info ts=2022-04-27T17:10:33.750083461Z caller=compact.go:947 msg="start of GC"
level=info ts=2022-04-27T17:10:33.804980678Z caller=compact.go:970 msg="start of compactions"
level=info ts=2022-04-27T17:10:33.805214601Z caller=compact.go:713 group="0@{idc_name=\"hailaer\", promCluster=\"k8s-hailaer\"}" groupKey=0@13427109783564785407 msg="compaction available and planned; downloading blocks" plan="[01G0VJVWYPZY6Q3RP1DRD8K7SH (min time: 1650182400000, max time: 1650189600000) 01G0VSQM4FCM87E2GKEGJ8B4QF (min time: 1650189600000, max time: 1650196800000) 01G0W0KBCFT9ETQ8VMESJJD5A1 (min time: 1650196800000, max time: 1650204000000) 01G0W7F2PM2GHEWB6937N6TQWH (min time: 1650204000000, max time: 1650211200000)]"
level=info ts=2022-04-27T17:10:33.805660241Z caller=compact.go:713 group="0@{idc_name=\"xiasha\", promCluster=\"k8s-xiasha\"}" groupKey=0@10665639288258739055 msg="compaction available and planned; downloading blocks" plan="[01G0Q9HBWEDKAEXEWDZ0D5SCMP (min time: 1650038400000, max time: 1650045600000) 01G0QGD34B1QFDF4TX7Y3CBAHE (min time: 1650045600000, max time: 1650052800000) 01G0QQ8TCH8D1AN2E4NBNRKXAJ (min time: 1650052800000, max time: 1650060000000) 01G0QY4HMFQX1247VRXXPY67SD (min time: 1650060000000, max time: 1650067200000)]"
level=info ts=2022-04-27T17:10:33.81722757Z caller=compact.go:713 group="0@{idc_name=\"txyun\", promCluster=\"faros-txyun-15\"}" groupKey=0@15087332170219289486 msg="compaction available and planned; downloading blocks" plan="[01G07V52DT2KV7JAGSAVMK1TBF (min time: 1649520000102, max time: 1649527200000) 01G0820SNSQQ2E74RMRGR56XVN (min time: 1649527200101, max time: 1649534400000) 01G088WGXSCGCV6YN2Q3YKBW8G (min time: 1649534400100, max time: 1649541600000) 01G08FR7QD76109QX60PMT0Y8X (min time: 1649541600100, max time: 1649548800000)]"
level=info ts=2022-04-27T17:11:25.945123588Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=11.838800383s cached=15118 returned=15118 partial=2
level=info ts=2022-04-27T17:12:26.279221844Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=12.172663213s cached=15119 returned=15119 partial=1

Environment:

frakev commented 2 years ago

Hello, Have you try to use: --wait-interval=5m Wait interval between consecutive compaction runs and bucket refreshes. Only works when --wait flag specified.

ChaoHsupin commented 2 years ago

Hello, Have you try to use: --wait-interval=5m Wait interval between consecutive compaction runs and bucket refreshes. Only works when --wait flag specified.

Thanks reply! I tried adding this parameter --wait-interval=5m ,but useless. There are still a large number of duplicate logs as follows:

level=info ts=2022-05-05T15:57:30.310923964Z caller=blocks_cleaner.go:58 msg="cleaning of blocks marked for deletion done"
level=info ts=2022-05-05T15:57:42.20936612Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.163160885s duration_ms=10163 cached=13960 returned=13960 partial=1
level=info ts=2022-05-05T15:58:42.229451273Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.183314864s duration_ms=10183 cached=13960 returned=13960 partial=1
level=info ts=2022-05-05T15:59:42.219694111Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.173728222s duration_ms=10173 cached=13960 returned=13960 partial=1
level=info ts=2022-05-05T16:00:42.447690898Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.40206302s duration_ms=10402 cached=13960 returned=13960 partial=1
level=info ts=2022-05-05T16:01:42.691945272Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.646627619s duration_ms=10646 cached=13960 returned=13960 partial=1
level=info ts=2022-05-05T16:02:17.940555585Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=13.620869519s duration_ms=13620 cached=13960 returned=1227 partial=1
level=info ts=2022-05-05T16:02:17.95942226Z caller=clean.go:34 msg="started cleaning of aborted partial uploads"
level=info ts=2022-05-05T16:02:17.95945469Z caller=clean.go:61 msg="cleaning of aborted partial uploads done"
level=info ts=2022-05-05T16:02:17.959464355Z caller=blocks_cleaner.go:44 msg="started cleaning of blocks marked for deletion"
level=info ts=2022-05-05T16:02:17.959559292Z caller=blocks_cleaner.go:58 msg="cleaning of blocks marked for deletion done"
level=info ts=2022-05-05T16:02:31.59749299Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=13.637959951s duration_ms=13637 cached=13960 returned=1227 partial=1
level=info ts=2022-05-05T16:02:42.722209006Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.676682571s duration_ms=10676 cached=13960 returned=13960 partial=1
level=info ts=2022-05-05T16:03:42.712906688Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.666756324s duration_ms=10666 cached=13960 returned=13960 partial=1
level=info ts=2022-05-05T16:04:42.570681982Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.524455654s duration_ms=10524 cached=13960 returned=13960 partial=1
level=info ts=2022-05-05T16:05:42.846779864Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=10.801424978s duration_ms=10801 cached=13960 returned=13960 partial=2
level=info ts=2022-05-05T16:07:03.456053746Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=31.410013001s duration_ms=31410 cached=13960 returned=13960 partial=2
level=info ts=2022-05-05T16:07:17.633996367Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=13.31411497s duration_ms=13314 cached=13960 returned=1227 partial=2
level=info ts=2022-05-05T16:07:30.984159971Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=13.330225285s duration_ms=13330 cached=13960 returned=1227 partial=2
yeya24 commented 2 years ago

@BegoniaGit Is the compactor currently in halted status? You can check it from the log or using the compactor metric thanos_compact_halted

ChaoHsupin commented 2 years ago

@yeya24 Thank you very much! The mission did stoped. Now there is no relevant error log, I feel very confused. Could you give me some suggestions for further investigation?

# HELP thanos_compact_halted Set to 1 if the compactor halted due to an unexpected error.
# TYPE thanos_compact_halted gauge
thanos_compact_halted 1
yeya24 commented 2 years ago

@yeya24 Thank you very much! The mission did stoped. Now there is no relevant error log, I feel very confused. Could you give me some suggestions for further investigation?

# HELP thanos_compact_halted Set to 1 if the compactor halted due to an unexpected error.
# TYPE thanos_compact_halted gauge
thanos_compact_halted 1

I think there must be some logs indicating the reason of halting. Maybe you can restart the compactor and it will hit the halted error again. THen you can take a look at why.

ChaoHsupin commented 2 years ago

@yeya24 Thank you! Restart later, how the thanos_compact_halted to 1 , get some details:

level=error ts=2022-05-15T11:41:18.413393736Z caller=compact.go:495 msg="critical error detected; halting" err="compaction: group 0@10665639288258739055: compact blocks [/var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G292HXWFQY79P88N50WATF0M /var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G299DN46V8XFSC10744SRXMX /var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G29G9CC6JCSJ64PKEK363P2S /var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G29Q53N8T3Q4SHEXNN4NTMN9]: 2 errors: populate block: write chunks: preallocate: no space left on device; sync /var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G33QXXJXNF9ZZD5WCNK0XYHV.tmp-for-creation/chunks/000002: file already closed"

At present, the disk is emptied first, then adjust the compression concurrency to 1, and then further observed.

yeya24 commented 2 years ago

@yeya24 Thank you! Restart later, how the thanos_compact_halted to 1 , get some details:

level=error ts=2022-05-15T11:41:18.413393736Z caller=compact.go:495 msg="critical error detected; halting" err="compaction: group 0@10665639288258739055: compact blocks [/var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G292HXWFQY79P88N50WATF0M /var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G299DN46V8XFSC10744SRXMX /var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G29G9CC6JCSJ64PKEK363P2S /var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G29Q53N8T3Q4SHEXNN4NTMN9]: 2 errors: populate block: write chunks: preallocate: no space left on device; sync /var/thanos/compactor/thanos-compactor-01/compact/0@10665639288258739055/01G33QXXJXNF9ZZD5WCNK0XYHV.tmp-for-creation/chunks/000002: file already closed"

At present, the disk is emptied first, then adjust the compression concurrency to 1, and then further observed.

So looks like you need more disk space.

yeya24 commented 2 years ago

Due to lack of response, I will close this one. Feel free to reopen it if you still see this issue