thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.04k stars 2.09k forks source link

Subject: Thanos Compactor Fails to Delete Downsampling Data, Resulting in Disk Space Overfill #7090

Open anilreddyb opened 8 months ago

anilreddyb commented 8 months ago

RECWZB25NHY7AT err="remove /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT: directory not empty" level=error ts=2024-01-24T13:15:57.569235075Z caller=compact.go:499 msg="retriable error" err="compaction: group 0@1151584605916149957: download block 01HFT1FJBC0FRECWZB25NHY7AT: copy object to file: write /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT/chunks/000012: no space left on device" level=warn ts=2024-01-24T13:38:07.872259646Z caller=objstore.go:386 group="0@{cluster=\"\", env=\"uat\", prometheus=\"observability/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-2\"}" groupKey=0@1151584605916149957 msg="failed to remove file on partial dir download error" file=/data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT err="remove /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT: directory not empty" level=error ts=2024-01-24T13:38:07.872381278Z caller=compact.go:499 msg="retriable error" err="compaction: group 0@1151584605916149957: download block 01HFT1FJBC0FRECWZB25NHY7AT: copy object to file: write /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT/chunks/000012: no space left on device" level=warn ts=2024-01-24T14:00:17.88467112Z caller=objstore.go:386 group="0@{cluster=\"dev-test\", env=\"uat\", prometheus=\"observability/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-2\"}" groupKey=0@1151584605916149957 msg="failed to remove file on partial dir download error" file=/data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT err="remove /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT: directory not empty" level=error ts=2024-01-24T14:00:17.884796883Z caller=compact.go:499 msg="retriable error" err="compaction: group 0@1151584605916149957: download block 01HFT1FJBC0FRECWZB25NHY7AT: copy object to file: write /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT/chunks/000012: no space left on device" level=warn ts=2024-01-24T14:22:27.497118376Z caller=objstore.go:386 group="0@{cluster=\"dev-test\", env=\"uat\", prometheus=\"observability/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-2\"}" groupKey=0@1151584605916149957 msg="failed to remove file on partial dir download error" file=/data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT err="remove /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT: directory not empty" level=error ts=2024-01-24T14:22:27.497247169Z caller=compact.go:499 msg="retriable error" err="compaction: group 0@1151584605916149957: download block 01HFT1FJBC0FRECWZB25NHY7AT: copy object to file: write /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT/chunks/000012: no space left on device" level=warn ts=2024-01-24T14:44:37.052358967Z caller=objstore.go:386 group="0@{cluster=\"dev-test\", env=\"uat\", prometheus=\"observability/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-2\"}" groupKey=0@1151584605916149957 msg="failed to remove file on partial dir download error" file=/data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT err="remove /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT: directory not empty" level=error ts=2024-01-24T14:44:37.05247423Z caller=compact.go:499 msg="retriable error" err="compaction: group 0@1151584605916149957: download block 01HFT1FJBC0FRECWZB25NHY7AT: copy object to file: write /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT/chunks/000012: no space left on device" level=warn ts=2024-01-24T15:06:51.78477219Z caller=objstore.go:386 group="0@{cluster=\"dev-test\", env=\"uat\", prometheus=\"observability/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-2\"}" groupKey=0@1151584605916149957 msg="failed to remove file on partial dir download error" file=/data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT err="remove /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT: directory not empty" level=error ts=2024-01-24T15:06:51.784890942Z caller=compact.go:499 msg="retriable error" err="compaction: group 0@1151584605916149957: download block 01HFT1FJBC0FRECWZB25NHY7AT: copy object to file: write /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT/chunks/000012: no space left on device" level=warn ts=2024-01-24T15:29:06.323220227Z caller=objstore.go:386 group="0@{cluster=\"dev-test\", env=\"uat\", prometheus=\"observability/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-2\"}" groupKey=0@1151584605916149957 msg="failed to remove file on partial dir download error" file=/data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT err="remove /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT: directory not empty" level=error ts=2024-01-24T15:29:06.32335083Z caller=compact.go:499 msg="retriable error" err="compaction: group 0@1151584605916149957: download block 01HFT1FJBC0FRECWZB25NHY7AT: copy object to file: write /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT/chunks/000012: no space left on device" level=warn ts=2024-01-24T15:51:20.028189832Z caller=objstore.go:386 group="0@{cluster=\"dev-test\", env=\"uat\", prometheus=\"observability/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-2\"}" groupKey=0@1151584605916149957 msg="failed to remove file on partial dir download error" file=/data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT err="remove /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT: directory not empty" level=error ts=2024-01-24T15:51:20.028325595Z caller=compact.go:499 msg="retriable error" err="compaction: group 0@1151584605916149957: download block 01HFT1FJBC0FRECWZB25NHY7AT: copy object to file: write /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT/chunks/000012: no space left on device"

anilreddyb commented 8 months ago

Thnos compactor config:

image
yeya24 commented 8 months ago

Error is

level=error ts=2024-01-24T13:15:57.569235075Z caller=compact.go:499 msg="retriable error" err="compaction: group 0@1151584605916149957: download block 01HFT1FJBC0FRECWZB25NHY7AT: copy object to file: write /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT/chunks/000012: no space left on device"

Please allocate more space for the compactor pod.

anilreddyb commented 8 months ago

We are observing disk space is left at 6GB on disk

image

How much disk space do we need to maintain with the standard process?

anilreddyb commented 8 months ago

The data inside meta.json mintime and maxtime is for nov2023 and all the latest logs are getting processed and data also available in s3 bucket but some reason its try to download old data(nov2023) we are not sure why its downloading?

douglascamata commented 8 months ago

How much disk space do we need to maintain with the standard process?

This is impossible to predict and also depends on your configuration. General guidance is:

In cases like this, simply give it more disk. Data deletion is the very last step in Compactor's algorithm.

anilreddyb commented 8 months ago

@douglascamata, Thanks for the respone, The follow up quesion when you say unlimited retention what exactly you mean, and bellow is our current configuration and does these config is seems fine: image

retentionResolutionRaw: 30d retentionResolution5m: 30d retentionResolution1h: 10y

We are aslo seeing bellow error in the compactore logs what exactly significes the issue and could it be due to wrong retention specified in configuration:

level=warn ts=2024-01-24T13:38:07.872259646Z caller=objstore.go:386 group="0@{cluster="", env="uat", prometheus="observability/kube-prometheus-stack-prometheus", prometheus_replica="prometheus-kube-prometheus-stack-prometheus-2"}" groupKey=0@1151584605916149957 msg="failed to remove file on partial dir download error" file=/data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT err="remove /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT: directory not empty" level=error ts=2024-01-24T13:38:07.872381278Z caller=compact.go:499 msg="retriable error" err="compaction: group 0@1151584605916149957: download block 01HFT1FJBC0FRECWZB25NHY7AT: copy object to file: write /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT/chunks/000012: no space left on device" level=warn ts=2024-01-24T14:00:17.88467112Z caller=objstore.go:386 group="0@{cluster="dev-test", env="uat", prometheus="observability/kube-prometheus-stack-prometheus", prometheus_replica="prometheus-kube-prometheus-stack-prometheus-2"}" groupKey=0@1151584605916149957 msg="failed to remove file on partial dir download error" file=/data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT err="remove /data/compact/0@1151584605916149957/01HFT1FJBC0FRECWZB25NHY7AT: directory not empty"

douglascamata commented 8 months ago

@anilreddyb with 10 years retention on 1h-downsampled metrics you will have problems in your system. Keep in mind that the Compactor has to be "aware" of literally all the blocks you have in your object storage. The Compactor (and Store Gateway) are often listing all blocks, checking out their meta files, checking for markers (other metadata) stored as files, etc. Now imagine the amount of requests that having 10 years of blocks there will be. Factor in that some providers will charge you based on amount of API requests...

Otherwise, focussing on yours logs and the fact that you have no disk space, that might be the reason for the other failures. I recommend to reduce your Compactor to 0 replicas, clean up your PVC and restart it.

anilreddyb commented 8 months ago

I executed the commands below, deleting the previous data. After some time, a new folder was generated, and the data within it was associated with the date range of October 12th to 16th. My inquiry is why the data specifically references the month of October. It's important to note that this data is already present in the S3 bucket. Since the deletion of the old data, the /data directory is now at 100% free, eliminating any disk space concerns. Therefore, it seems the issue is unrelated to disk space.

thanos tools bucket retention --objstore.config-file=/conf/objstore.yml thanos tools bucket cleanup --delete-delay=0s --objstore.config-file=/conf/objstore.yml

image

Is there a way to investigate why the data consistently points to the month of October, even after manually deleting the old data multiple times?

douglascamata commented 8 months ago

why the data consistently points to the month of October

I don't understand the question. What do you mean with "data consistently points"?

anilreddyb commented 8 months ago

@douglascamata, I removed the old data within the /data/compact/ folder. Subsequently, new folders are generated. When I examine the metadata.json file's min-time and max-time, it displays a timestamp from the month of October. My question is, why does it indicate October when the file was created today? The timestamp should reflect the current date and month if the file was generated today. If there's a retention period of the last 30 days, the metadata.json file should ideally show a timestamp within this timeframe. If it indicates October instead, there may be a configuration issue with the retention policy or data management process that needs to be addressed.

retentionResolutionRaw: 30d retentionResolution5m: 30d retentionResolution1h: 10y

douglascamata commented 8 months ago

There should be one metadata.json file per block you have in object storage, never only one (unless you only have 1 block, of course). And you are still keeping 10y of 1h-resolution data.

There's no issue that we are aware of with retention policy or data management.

douglascamata commented 8 months ago

You need to let the Compactor run and monitor its metrics to see wether it's working. Looking at the filesystem without deep understanding of how the Compactor works will only confuse you.