thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.09k stars 2.1k forks source link

[Thanos Compact] Downsampling issue on version 0.30 #6097

Open heliapb opened 1 year ago

heliapb commented 1 year ago

Thanos, Prometheus and Golang version used: Thanos: 0.30 Prometheus: 2.39.1

Object Storage Provider: Azure

What happened:

Hi, we just stated to use the thanos version 0.30, but needed to roll back to the 0.29, following an issue with compact downsampling

level=error ts=2023-02-02T09:12:03.497989324Z caller=main.go:161 err="downsampling to 5 min: download block 01GR8BWH3EKD3N898HCZ07DXBW: copy object to file: context deadline exceeded\nfirst pass of downsampling failed\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:441\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:477\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:476\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nerror executing compaction\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:504\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:476\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"

We decided to roll back following looking at this issue https://github.com/thanos-io/thanos/issues/5272, that had the same error, that was fixed on the 0.29. But somehow we got this issue with the 0.30, so if you could take a look.

What you expected to happen: For the version 0.30 to not have the down sampling issue, as it was fixed on the 0.29

How to reproduce it (as minimally and precisely as possible): Use thanos charts https://github.com/bitnami/charts/tree/main/bitnami/thanos with the 0.30 image version Full logs to relevant components:

Environment:

fpetkovski commented 1 year ago

Not sure if this is an issue with Thanos specifically. The error message seems to indicate a time out when trying to download a file from object storage:

copy object to file: context deadline exceeded
heliapb commented 1 year ago

Hi there, seems unlikely since after rolling back to the previous version 0 29 we stopped having this issue

yeya24 commented 1 year ago

@heliapb If you go back to v0.30 again are you able to always reproduce this issue? To me this is a network connection issue

heliapb commented 1 year ago

Hi, tested with the version 0.30.2, got the same issue

level=error ts=2023-02-28T09:32:38.029414631Z caller=main.go:161 err="downsampling to 5 min: download block 01GTB99GT0VX7C8NA57RPCRS0H: copy object to file: context deadline exceeded\nfirst pass of downsampling failed\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:441\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:477\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:476\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nerror executing compaction\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:504\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:476\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"

We were just change to this version yesterday, previously we were running the version 0.29 for a month with no issues for compact, just took 24h to have the same error.

glimberg commented 1 year ago

Also running into this issue. Happens in both 0.30.2 and 0.31.0. This is on a new Thanos install on top of an existing Prometheus instance with ~1year of data uploaded to a GCS bucket via the sidecar.

downsampling to 5 min: download block 01G5F08E37DQ62W1FXAA82KCF3: context canceled
first pass of downsampling failed
main.runCompact.func7
    /app/cmd/thanos/compact.go:441
main.runCompact.func8.1
    /app/cmd/thanos/compact.go:477
github.com/thanos-io/thanos/pkg/runutil.Repeat
    /app/pkg/runutil/runutil.go:74
main.runCompact.func8
    /app/cmd/thanos/compact.go:476
github.com/oklog/run.(*Group).Run.func1
    /go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1594
error executing compaction
main.runCompact.func8.1
    /app/cmd/thanos/compact.go:504
github.com/thanos-io/thanos/pkg/runutil.Repeat
    /app/pkg/runutil/runutil.go:74
main.runCompact.func8
    /app/cmd/thanos/compact.go:476
github.com/oklog/run.(*Group).Run.func1
    /go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1594
compact command failed
main.main
    /app/cmd/thanos/main.go:161
runtime.main
    /usr/local/go/src/runtime/proc.go:250
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1594"

Block info causing the error:

01G5F08E37DQ62W1FXAA82KCF3
Start Time: May 23, 2022 11:00 PM
End Time: June 13, 2022 5:00 AM
Duration: 20 days
Series: 468753
Samples: 8128685939
Chunks: 67831539
Resolution: 0
Level: 6
Source: sidecar
glimberg commented 1 year ago

Think I figured this out, in my case at least. I deployed thanos via the bitnami/thanos helm chart. In its configuration for compactor, it provisions a persistent volume with a default size of 10Gi. In my case, some of these old blocks are 12Gi+ so they couldn't be downloaded because the PV wasn't large enough to hold it all. I increased the PV size and now it's running correctly.

glimberg commented 1 year ago

Better error messages for this case would be very helpful. Nothing in the error message above gives a hint that it's a storage space issue.

heliapb commented 1 year ago

Well we are working in our case with PV sizes up to 500Gi due to the amount of metrics we are working with, and not using the default sizes of thanos pv, and we also use the bitnami charts

glimberg commented 1 year ago

@heliapb Yeah we had 500GB PVs for Prometheus it self. Older blocks were 13+GB each. Only thing I needed to change for this to work in our case was to bump up the PV size for the compactor itself. I bumped the compactor deployment's PV from the default 10GB up to 50GB and now have plenty of room for it to download & downsample the blocks. Can't say for certain if you're running into the exact same thing, but I'd suggest checking the allocated size of the PV assigned to the compactor deployment.

heliapb commented 1 year ago

@glimberg The compact size is 500Gi , not the prometheus, prometheus we keep it very low as we work with 4h retention, we work with custom values for all our thanos infrastructure

BouchaaraAdil commented 1 month ago

we are hitting

ts=2024-09-10T19:03:35.270631673Z caller=compact.go:546 level=error msg="retriable error" err="first pass of downsampling failed: 6 errors: downsampling to 5 min: download block 01J4869K764KGS962QA5YMAJJFF: context canceled; downsampling to 5 min: download block 01J4866MSTNHVSV1MDNB9ZP23FE: context canceled; downsampling to 5 min: download block 01J6M36MGSNXRT2BK6KZYZDTZ3: context canceled; downsampling to 5 min: download block 01J00SBZ3XMYJ3N0RM1854940495: context canceled; downsampling to 5 min: download block 01J6M9Q9175759DEWQKSD0ANN: context canceled; downsampling to 60 min: download block 01J7EEA1WJ7PE0697971V96V746V: get file 01J7EE34567PERM971V96V746V/index: The difference between the request time and the current time is too large.

The size of blocks Loki failed to download is bigger than 20 GiB