thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.1k stars 2.1k forks source link

compact: once it run out of disk space, it keeps running but is still broken #7197

Open calestyo opened 8 months ago

calestyo commented 8 months ago

Thanos, Prometheus and Golang version used:

# thanos --version
thanos, version 0.34.1 (branch: HEAD, revision: 4cf1559998bf6d8db3f9ca0fde2a00d217d4e23e)
  build user:       root@61db75277a55
  build date:       20240219-17:13:48
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo

Object Storage Provider: filesystem

What happened: I run several times now into the problem, that compact ran out of local disk space, e.g. with an error like:

Mar 04 05:48:59 foo thanos@compact[2138629]: ts=2024-03-04T04:48:59.919722752Z caller=compact.go:491 level=error msg="critical error detected; halting" err="compaction: group 0@2951439482451486615: compact blocks [/data/main/thanos/compact/cache/compact/0@2951439482451486615/01HNPSBZE5B3W20MPYBB1ANBC9 /data/main/thanos/compact/cache/compact/0@2951439482451486615/01HNVY3CMAQGJ5FJVXFJ0HF8MX /data/main/thanos/compact/cache/compact/0@2951439482451486615/01HP12S5NVRYB1HGRRZF3WGHYS /data/main/thanos/compact/cache/compact/0@2951439482451486615/01HP67KTYPRMZ7REX5D5P7HBRH /data/main/thanos/compact/cache/compact/0@2951439482451486615/01HPBCDE60PGQT0NVJMAZN4REY /data/main/thanos/compact/cache/compact/0@2951439482451486615/01HPGH0QTWZG37HR1TVA3YWKHQ /data/main/thanos/compact/cache/compact/0@2951439482451486615/01HPNNYRYMC8F7PVP67THGF7JA]: 2 errors: populate block: write chunks: preallocate: no space left on device; sync /data/main/thanos/compact/cache/compact/0@2951439482451486615/01HR3WCT1D1N7CH7Y763F13ZRQ.tmp-for-creation/chunks/000063: file already closed"

The problem is, once this happened, it seems to be in a broken state and even if disk space gets free, it doesn't try again. It doesn't exit with a failure either, so there's no real chance for a service manager like systemd to notice the problem (and e.g. send mail or so).

What you expected to happen:

There should be an option for each component (including compact), which causes it to exit immediately with some non-zero exit status in case of an error.

Cheers, Chris.

douglascamata commented 8 months ago

Which options did you use to start this Compactor?

calestyo commented 8 months ago

Options are:

/usr/bin/thanos compact --log.level=warn --wait --data-dir=/data/main/thanos/compact/cache --objstore.config-file=/etc/thanos/data-thanos.yml --consistency-delay=1h --hash-func=SHA256 --http-address=:10912 --web.external-prefix=/thanos/compact --web.route-prefix=/

But I just found --no-debug.halt-on-error which seems as if it may be causing the right behaviour.

I wonder why that isn't the default.
The description why not, is honestly not very convincing.

Yes, it would be bad to retry over and over if it just keeps failing, but there are much better ways to handle this.
a) Not retrying endlessly is anyway the service manager’s (i.e. systemd’s) responsibility, and there are plenty of proper ways in that to handle this and b) thanos could e.g. just set some flag on such error condition, that lets it fail immediately without retrying (which would still allow for the important part, namely failing - otherwise people in normal environments won't notice that anything's wrong).

danksim commented 7 months ago

I am seeing a similar behavior where it errors once (no space left on device), halts and never compacts again.

level=error ts=2024-03-19T16:41:39.30753731Z caller=compact.go:486 msg="critical error detected; halting" err="compaction: group 0@16225028991790494220: compact blocks [/data/compact/0@16225028991790494220/01F3EZ3ZAA58NSNX2EPRS03N5H /data/compact/0@16225028991790494220/01F3M3W8CDMFQP1ZTS72F07EGQ /data/compact/0@16225028991790494220/01F3S8RRSXCY18WNCS6NS6T1N4 /data/compact/0@16225028991790494220/01F3YDPP5RRD4FFERPTVTRY86T /data/compact/0@16225028991790494220/01F43J9M8VME00PDVN2FQ84SFH /data/compact/0@16225028991790494220/01F48Q59D5Z77DG93SGEYVZ8B1 /data/compact/0@16225028991790494220/01F4DW1XNESEQH4RBW5AZMN6P4]: 3 errors: populate block: write chunks: preallocate: no space left on device; sync /data/compact/0@16225028991790494220/01HSBRPYTTJ57GJBME0FR28SFV.tmp-for-creation/chunks/000088: file already closed; write /data/compact/0@16225028991790494220/01HSBRPYTTJ57GJBME0FR28SFV.tmp-for-creation/index: no space left on device"
level=info ts=2024-03-19T16:42:04.648266954Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.059553102s duration_ms=2059 cached=40630 returned=40630 partial=0
level=info ts=2024-03-19T16:42:32.306773124Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=43.087680327s duration_ms=43087 cached=40630 returned=40610 partial=0
level=info ts=2024-03-19T16:43:15.869657891Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=43.552571092s duration_ms=43552 cached=40630 returned=40610 partial=0
level=info ts=2024-03-19T16:43:15.88204689Z caller=clean.go:34 msg="started cleaning of aborted partial uploads"
level=info ts=2024-03-19T16:43:15.88207772Z caller=clean.go:61 msg="cleaning of aborted partial uploads done"
level=info ts=2024-03-19T16:43:15.882085075Z caller=blocks_cleaner.go:44 msg="started cleaning of blocks marked for deletion"
level=info ts=2024-03-19T16:43:15.882099343Z caller=blocks_cleaner.go:58 msg="cleaning of blocks marked for deletion done"

I am running the compact component with a 100gb persistent volume and after the error, it goes back to ~50% usage: image

douglascamata commented 7 months ago

To be clear: you won't ever recover Compactor running out of disk space without giving it more disk. I agree with the suggested expected behavior from @calestyo though, the Compactor should exit when it detects it doesn't have enough disk space available.

To fix the Compactor:

Finally when the Compactor catches up you can follow a similar procedure to shrink the PV.

calestyo commented 7 months ago

you won't ever recover Compactor running out of disk space without giving it more disk.

The problem is however, even with more disk (respectively space freed) it never recovers and does anything useful again.

I think the default behaviour of daemons is to fail in such situation (so that the service manager, like systemd, can react)... a --keep-going mode might be nice for some people, but then only if it actually ever does something again (which seems not the case).

douglascamata commented 7 months ago

The problem is however, even with more disk (respectively space freed) it never recovers and does anything useful again.

You will need to give it more total disk space than it had before. The amount of extra space required is proportional to the time the Compactor had issues. I've seen setups where the Compactor had initially about 200 GB of disk and we had to give it 4x more.

As I said, I agree with you that the Compactor's default behavior is not ideal. We should improve it.

calestyo commented 7 months ago

You will need to give it more total disk space than it had before.

I don't see why that should need to be the case. The filesystem could have had some other big occupier of space and with that being gone it could be just enough for compact, even with the total size being the same.

The other issue you're referring to, that it needs more and more space, is what I've described in #7198, which is probably a duplicate of #3405.

IMO the proper behaviour would be as follows:

calestyo commented 4 months ago

@douglascamata Could you please remove needs-more-info? This is no longer the case.

calestyo commented 4 months ago

Yet another case where compact runs into troubles, never fails, never recovers, no-one will ever notice that it's actually no longer doing anything:

For some reasons (don't ask me how that happens, perhaps another bug in Thanos?) overlapping blocks showed up:

Jul 04 05:59:11 lcg-lrz-monitor-ext thanos@compact[2524381]: ts=2024-07-04T03:59:11.880887578Z caller=main.go:161 level=error err="group 0@16428505770164304275: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s, blocks: 2]: <ulid: 01J0Z8WZBKK3G2VVVJN70WQN32, mint: 1718841600085, maxt: 1719014400000, range: 47h59m59s>, <ulid: 01J10W3EJT73RS0W5EPMFMKK82, mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s>\ncompaction\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:432\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:517\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:516\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncritical error detected\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:531\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:516\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
Jul 04 05:59:19 lcg-lrz-monitor-ext thanos@compact[2524442]: ts=2024-07-04T03:59:19.844193328Z caller=intrumentation.go:67 level=warn msg="changing probe status" status=not-ready reason="critical error detected: compaction: group 0@16428505770164304275: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s, blocks: 2]: <ulid: 01J0Z8WZBKK3G2VVVJN70WQN32, mint: 1718841600085, maxt: 1719014400000, range: 47h59m59s>, <ulid: 01J10W3EJT73RS0W5EPMFMKK82, mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s>"
Jul 04 05:59:19 lcg-lrz-monitor-ext thanos@compact[2524442]: ts=2024-07-04T03:59:19.844552963Z caller=main.go:161 level=error err="group 0@16428505770164304275: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s, blocks: 2]: <ulid: 01J0Z8WZBKK3G2VVVJN70WQN32, mint: 1718841600085, maxt: 1719014400000, range: 47h59m59s>, <ulid: 01J10W3EJT73RS0W5EPMFMKK82, mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s>\ncompaction\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:432\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:517\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:516\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncritical error detected\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:531\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:516\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
Jul 04 05:59:27 lcg-lrz-monitor-ext thanos@compact[2524512]: ts=2024-07-04T03:59:27.839397302Z caller=intrumentation.go:67 level=warn msg="changing probe status" status=not-ready reason="critical error detected: compaction: group 0@16428505770164304275: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s, blocks: 2]: <ulid: 01J0Z8WZBKK3G2VVVJN70WQN32, mint: 1718841600085, maxt: 1719014400000, range: 47h59m59s>, <ulid: 01J10W3EJT73RS0W5EPMFMKK82, mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s>"
Jul 04 05:59:27 lcg-lrz-monitor-ext thanos@compact[2524512]: ts=2024-07-04T03:59:27.840081544Z caller=main.go:161 level=error err="group 0@16428505770164304275: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s, blocks: 2]: <ulid: 01J0Z8WZBKK3G2VVVJN70WQN32, mint: 1718841600085, maxt: 1719014400000, range: 47h59m59s>, <ulid: 01J10W3EJT73RS0W5EPMFMKK82, mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s>\ncompaction\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:432\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:517\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:516\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncritical error detected\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:531\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:516\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
Jul 04 05:59:33 lcg-lrz-monitor-ext thanos@compact[2524570]: ts=2024-07-04T03:59:33.281127456Z caller=intrumentation.go:67 level=warn msg="changing probe status" status=not-ready reason="critical error detected: compaction: group 0@16428505770164304275: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s, blocks: 2]: <ulid: 01J0Z8WZBKK3G2VVVJN70WQN32, mint: 1718841600085, maxt: 1719014400000, range: 47h59m59s>, <ulid: 01J10W3EJT73RS0W5EPMFMKK82, mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s>"
Jul 04 05:59:33 lcg-lrz-monitor-ext thanos@compact[2524570]: ts=2024-07-04T03:59:33.285630641Z caller=compact.go:633 level=error msg="retriable error" err="BaseFetcher: iter bucket: context canceled"
Jul 04 05:59:33 lcg-lrz-monitor-ext thanos@compact[2524570]: ts=2024-07-04T03:59:33.286061605Z caller=main.go:161 level=error err="group 0@16428505770164304275: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s, blocks: 2]: <ulid: 01J0Z8WZBKK3G2VVVJN70WQN32, mint: 1718841600085, maxt: 1719014400000, range: 47h59m59s>, <ulid: 01J10W3EJT73RS0W5EPMFMKK82, mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s>\ncompaction\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:432\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:517\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:516\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncritical error detected\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:531\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:516\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
Jul 04 05:59:38 lcg-lrz-monitor-ext thanos@compact[2524630]: ts=2024-07-04T03:59:38.626967738Z caller=intrumentation.go:67 level=warn msg="changing probe status" status=not-ready reason="critical error detected: compaction: group 0@16428505770164304275: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s, blocks: 2]: <ulid: 01J0Z8WZBKK3G2VVVJN70WQN32, mint: 1718841600085, maxt: 1719014400000, range: 47h59m59s>, <ulid: 01J10W3EJT73RS0W5EPMFMKK82, mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s>"
Jul 04 05:59:38 lcg-lrz-monitor-ext thanos@compact[2524630]: ts=2024-07-04T03:59:38.630323903Z caller=compact.go:604 level=error msg="retriable error" err="syncing metas: BaseFetcher: iter bucket: context canceled"
Jul 04 05:59:38 lcg-lrz-monitor-ext thanos@compact[2524630]: ts=2024-07-04T03:59:38.630590996Z caller=main.go:161 level=error err="group 0@16428505770164304275: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s, blocks: 2]: <ulid: 01J0Z8WZBKK3G2VVVJN70WQN32, mint: 1718841600085, maxt: 1719014400000, range: 47h59m59s>, <ulid: 01J10W3EJT73RS0W5EPMFMKK82, mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s>\ncompaction\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:432\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:517\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:516\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncritical error detected\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:531\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:516\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
Jul 04 05:59:45 lcg-lrz-monitor-ext thanos@compact[2524687]: ts=2024-07-04T03:59:45.697592546Z caller=intrumentation.go:67 level=warn msg="changing probe status" status=not-ready reason="critical error detected: compaction: group 0@16428505770164304275: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s, blocks: 2]: <ulid: 01J0Z8WZBKK3G2VVVJN70WQN32, mint: 1718841600085, maxt: 1719014400000, range: 47h59m59s>, <ulid: 01J10W3EJT73RS0W5EPMFMKK82, mint: 1718985602708, maxt: 1718992800000, range: 1h59m57s>"

Despite compact being invoked with --no-debug.halt-on-error it just mindlessly tries over and over. The reasons – I don't know, cause how should the overlapping blocks disappear if nobody notices them (unless by chance), because the service doesn't fail properly?

douglascamata commented 4 months ago

@calestyo do you have vertical compaction enabled run version 0.35 or above? There's literally a PR that intentionally made Compactor not halt on overlapping blocks errors: https://github.com/thanos-io/thanos/pull/7225.

douglascamata commented 4 months ago

We are not aware of any bug in Thanos that creates overlapping blocks. I think this is more about how you use the tool rather than a defect in the tool itself. I never saw Thanos create overlapping blocks in the deployments I oversee.