thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.14k stars 2.1k forks source link

Thanos 504 gateway timeout when accessing S3 object storage #5298

Open rrraditya opened 2 years ago

rrraditya commented 2 years ago

Thanos, Prometheus and Golang version used:

/bin/thanos --version thanos, version 0.17.2 (branch: HEAD, revision: 37e6ef61566c7c70793ba6d128f00c4c66cb2402) build user: root@92283ccb0bc0 build date: 20201208-10:00:57 go version: go1.15 platform: linux/amd64

/bin/prometheus --version prometheus, version 2.19.1 (branch: HEAD, revision: eba3fdcbf0d378b66600281903e3aab515732b39) build user: root@62700b3d0ef9 build date: 20200618-16:35:26 go version: go1.14.4

Object Storage Provider: Internal Cloudian S3

What happened: Thanos compact stops to running and when we want to verify the issue if there's any overlapped blocks, the bucket verify is failing due timeout and thanos store also get timeout with same error.

What you expected to happen: bucket verify can listdown all the overlapped bucket and not getting timeout. can we adjust timeout for block metadata fetch? I couldn't find the proper option in the documentation.

Full logs to relevant components:

level=info ts=2022-04-25T10:04:16.761634282Z caller=main.go:98 msg="Tracing will be disabled" level=info ts=2022-04-25T10:04:16.761760401Z caller=factory.go:46 msg="loading bucket configuration" level=info ts=2022-04-25T10:04:16.763511746Z caller=verify.go:130 verifiers=overlapped_blocks,index_known_issues msg="Starting verify task" level=info ts=2022-04-25T10:04:16.763547942Z caller=overlapped_blocks.go:29 verifiers=overlapped_blocks,index_known_issues verifier=overlapped_blocks msg="started verifying issue" level=error ts=2022-04-25T10:27:34.614070057Z caller=main.go:131 err="504 Gateway Time-out\nBaseFetcher: iter bucket\ngithub.com/thanos-io/thanos/pkg/block.(BaseFetcher).fetchMetadata\n\t/app/pkg/block/fetcher.go:359\ngithub.com/thanos-io/thanos/pkg/block.(BaseFetcher).fetch.func2\n\t/app/pkg/block/fetcher.go:420\ngithub.com/golang/groupcache/singleflight.(Group).Do\n\t/go/pkg/mod/github.com/golang/groupcache@v0.0.0-20200121045136-8c9f03a8e57e/singleflight/singleflight.go:56\ngithub.com/thanos-io/thanos/pkg/block.(BaseFetcher).fetch\n\t/app/pkg/block/fetcher.go:418\ngithub.com/thanos-io/thanos/pkg/block.(MetaFetcher).Fetch\n\t/app/pkg/block/fetcher.go:479\ngithub.com/thanos-io/thanos/pkg/verifier.fetchOverlaps\n\t/app/pkg/verifier/overlapped_blocks.go:48\ngithub.com/thanos-io/thanos/pkg/verifier.OverlappedBlocksIssue.Verify\n\t/app/pkg/verifier/overlapped_blocks.go:31\ngithub.com/thanos-io/thanos/pkg/verifier.(Manager).Verify\n\t/app/pkg/verifier/verify.go:135\nmain.registerBucketVerify.func1\n\t/app/cmd/thanos/tools_bucket.go:173\nmain.main\n\t/app/cmd/thanos/main.go:129\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374\nfetch overlaps\ngithub.com/thanos-io/thanos/pkg/verifier.OverlappedBlocksIssue.Verify\n\t/app/pkg/verifier/overlapped_blocks.go:33\ngithub.com/thanos-io/thanos/pkg/verifier.(Manager).Verify\n\t/app/pkg/verifier/verify.go:135\nmain.registerBucketVerify.func1\n\t/app/cmd/thanos/tools_bucket.go:173\nmain.main\n\t/app/cmd/thanos/main.go:129\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374\nverify overlapped_blocks\ngithub.com/thanos-io/thanos/pkg/verifier.(Manager).Verify\n\t/app/pkg/verifier/verify.go:136\nmain.registerBucketVerify.func1\n\t/app/cmd/thanos/tools_bucket.go:173\nmain.main\n\t/app/cmd/thanos/main.go:129\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374\npreparing tools bucket verify command failed\nmain.main\n\t/app/cmd/thanos/main.go:131\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"

Anything else we need to know:

Environment:

cat /etc/redhat-release CentOS Linux release 7.9.2009 (Core) uname -a Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

wiardvanrij commented 2 years ago

Thanks for the report. Not that I'm exactly aware of we made that many changes there, could you upgrade 0.17.2 towards the latest release for a test if that makes any difference?

rrraditya commented 2 years ago

hi @wiardvanrij i will give it a try and back to you with the result. thank you.

rrraditya commented 2 years ago

hi All,

after some investigation the 504 gateway timeout was triggered by the S3 load balancer due to long time when query list of objects. because the objects is quite a lot.

so can we manually clear some prometheus blocks data? is it going to impact other thanos components because we delete old prometheus blocks data manually? because we cannot depends on thanos compactor to delete old data based on retention, because thanos compact will also got timeout when it try to query the object storage.

Thank you.

rrraditya commented 2 years ago

hi @wiardvanrij hope you can find this ticket again and help us on this inquiries. thank you.

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.