thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.13k stars 2.1k forks source link

Compactor: error executing compaction: invalid size #6380

Open oleksiytsyban opened 1 year ago

oleksiytsyban commented 1 year ago

Thanos, Prometheus and Golang version used: Deployed via Helm. docker.io/bitnami/thanos:0.31.0-scratch-r1 quay.io/prometheus/prometheus:v2.43.0-stringlabels

Object Storage Provider: S3

What happened: Compactor crashes attempting to compacts recently created block. It began soon after enabling vertical compaction. The block details:

01H0R4SZD0FRV0WYF8DJT2FWHC block:

Start Time: April 28, 2023 6:00 PM End Time: May 10, 2023 6:00 PM Duration: 12 days Series: 10637711 Samples: 3787156705 Chunks: 50901130 Resolution: 300000 Level: 6 Source: compactor

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

level=info ts=2023-05-18T19:25:13.725920093Z caller=compact.go:460 msg="compact blocks" count=6 mint=1682726400001 maxt=1683763200000 ulid=01H0R4SZD0FRV0WYF8DJT2FWHC sources="[01H0PSNM9MRHAV20CKQFK6MSPZ 01H0PWZW4K5X4YWQJJ8MJRSP41 01H0PXN7H0J09CVQC9J793CSAW 01H0PYC0QQ4A7M4PWCXW3CW7X3 01H0PZ1790GD3AGQ4HGXRB639B 01H0PZPS6N1FPDZW086CPF7CEY]" duration=6m22.770877172s level=info ts=2023-05-18T19:25:13.874672888Z caller=compact.go:1097 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="compacted blocks" new=01H0R4SZD0FRV0WYF8DJT2FWHC blocks="[/data/compact/300000@10792396202523153392/01H0PSNM9MRHAV20CKQFK6MSPZ /data/compact/300000@10792396202523153392/01H0PWZW4K5X4YWQJJ8MJRSP41 /data/compact/300000@10792396202523153392/01H0PXN7H0J09CVQC9J793CSAW /data/compact/300000@10792396202523153392/01H0PYC0QQ4A7M4PWCXW3CW7X3 /data/compact/300000@10792396202523153392/01H0PZ1790GD3AGQ4HGXRB639B /data/compact/300000@10792396202523153392/01H0PZPS6N1FPDZW086CPF7CEY]" duration=6m22.919629574s duration_ms=382919 overlapping_blocks=false

level=info ts=2023-05-18T19:28:16.8484575Z caller=compact.go:1169 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="marking compacted block for deletion" old_block=01H0PWZW4K5X4YWQJJ8MJRSP41 level=warn ts=2023-05-18T19:28:16.865935134Z caller=block.go:185 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="requested to mark for deletion, but file already exists; this should not happen; investigate" err="file 01H0PWZW4K5X4YWQJJ8MJRSP41/deletion-mark.json already exists in bucket" level=info ts=2023-05-18T19:28:16.904473356Z caller=compact.go:1169 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="marking compacted block for deletion" old_block=01H0PXN7H0J09CVQC9J793CSAW level=warn ts=2023-05-18T19:28:16.91392193Z caller=block.go:185 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="requested to mark for deletion, but file already exists; this should not happen; investigate" err="file 01H0PXN7H0J09CVQC9J793CSAW/deletion-mark.json already exists in bucket" level=info ts=2023-05-18T19:28:16.951007978Z caller=compact.go:1169 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="marking compacted block for deletion" old_block=01H0PYC0QQ4A7M4PWCXW3CW7X3 level=warn ts=2023-05-18T19:28:16.961776405Z caller=block.go:185 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="requested to mark for deletion, but file already exists; this should not happen; investigate" err="file 01H0PYC0QQ4A7M4PWCXW3CW7X3/deletion-mark.json already exists in bucket" level=info ts=2023-05-18T19:28:16.998395388Z caller=compact.go:1169 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="marking compacted block for deletion" old_block=01H0PZ1790GD3AGQ4HGXRB639B level=warn ts=2023-05-18T19:28:17.009230178Z caller=block.go:185 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="requested to mark for deletion, but file already exists; this should not happen; investigate" err="file 01H0PZ1790GD3AGQ4HGXRB639B/deletion-mark.json already exists in bucket" level=info ts=2023-05-18T19:28:17.060040144Z caller=compact.go:1169 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="marking compacted block for deletion" old_block=01H0PZPS6N1FPDZW086CPF7CEY level=warn ts=2023-05-18T19:28:17.07085462Z caller=block.go:185 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="requested to mark for deletion, but file already exists; this should not happen; investigate" err="file 01H0PZPS6N1FPDZW086CPF7CEY/deletion-mark.json already exists in bucket" level=info ts=2023-05-18T19:28:17.070882193Z caller=compact.go:1156 group="300000@{cluster_location=\"dca06\", cluster_name=\"cops-eks01-dca06\", environment=\"ai-prod\", group_name=\"dca06-c01-acc00\", object_name=\"dca06-c01-acc00\", prometheus=\"monitoring/cprc-kube-prometheus-stack-prometheus\", rc_layer=\"c01\", source_id=\"ai-dca06\"}" groupKey=300000@10792396202523153392 msg="finished compacting blocks" result_block=01H0R4SZD0FRV0WYF8DJT2FWHC source_blocks="[/data/compact/300000@10792396202523153392/01H0PSNM9MRHAV20CKQFK6MSPZ /data/compact/300000@10792396202523153392/01H0PWZW4K5X4YWQJJ8MJRSP41 /data/compact/300000@10792396202523153392/01H0PXN7H0J09CVQC9J793CSAW /data/compact/300000@10792396202523153392/01H0PYC0QQ4A7M4PWCXW3CW7X3 /data/compact/300000@10792396202523153392/01H0PZ1790GD3AGQ4HGXRB639B /data/compact/300000@10792396202523153392/01H0PZPS6N1FPDZW086CPF7CEY]" duration=16m2.001473681s duration_ms=962001

level=info ts=2023-05-18T19:33:12.148405863Z caller=downsample.go:356 msg="downloaded block" id=01H0R4SZD0FRV0WYF8DJT2FWHC duration=4m50.253160486s duration_ms=290253 level=info ts=2023-05-18T19:33:16.084442318Z caller=fetcher.go:478 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.888808102s duration_ms=1888 cached=1871 returned=1871 partial=0 level=info ts=2023-05-18T19:33:58.249950384Z caller=streamed_block_writer.go:178 msg="finalized downsampled block" mint=1682726400001 maxt=1683763200000 ulid=01H0R5NM2NVS95B1J5YCT0Q0ER resolution=3600000 level=warn ts=2023-05-18T19:33:58.257884095Z caller=intrumentation.go:67 msg="changing probe status" status=not-ready reason="error executing compaction: first pass of downsampling failed: downsampling to 60 min: downsample block 01H0R4SZD0FRV0WYF8DJT2FWHC to window 3600000: downsample aggregate block, series: 1536060: invalid size" level=info ts=2023-05-18T19:33:58.257924645Z caller=http.go:91 service=http/server component=compact msg="internal server is shutting down" err="error executing compaction: first pass of downsampling failed: downsampling to 60 min: downsample block 01H0R4SZD0FRV0WYF8DJT2FWHC to window 3600000: downsample aggregate block, series: 1536060: invalid size" level=info ts=2023-05-18T19:33:58.258041902Z caller=http.go:110 service=http/server component=compact msg="internal server is shutdown gracefully" err="error executing compaction: first pass of downsampling failed: downsampling to 60 min: downsample block 01H0R4SZD0FRV0WYF8DJT2FWHC to window 3600000: downsample aggregate block, series: 1536060: invalid size" level=info ts=2023-05-18T19:33:58.258060949Z caller=intrumentation.go:81 msg="changing probe status" status=not-healthy reason="error executing compaction: first pass of downsampling failed: downsampling to 60 min: downsample block 01H0R4SZD0FRV0WYF8DJT2FWHC to window 3600000: downsample aggregate block, series: 1536060: invalid size" level=error ts=2023-05-18T19:33:58.258171883Z caller=main.go:161 err="downsampling to 60 min: downsample block 01H0R4SZD0FRV0WYF8DJT2FWHC to window 3600000: downsample aggregate block, series: 1536060: invalid size\nfirst pass of downsampling failed\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:441\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:477\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:476\ngithub.com/oklog/run.(Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nerror executing compaction\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:504\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:476\ngithub.com/oklog/run.(Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:159

Anything else we need to know:

yeya24 commented 1 year ago

Hi @oleksiytsyban, thanks for the report. Did you enable native histograms with your Prometheus? Just want to double check if it was caused by something new

oleksiytsyban commented 1 year ago

Hi @oleksiytsyban, thanks for the report. Did you enable native histograms with your Prometheus?

Hi. No, we did not.