thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.09k stars 2.1k forks source link

Thanos Store - GroupCache error on deletion-mark.json #5265

Open nicolastakashi opened 2 years ago

nicolastakashi commented 2 years ago

Thanos, Prometheus and Golang version used:

Object Storage Provider: Azure

What happened: When I'm using the Thanos Store Group Cache feature, I'm facing a bunch of errors on logs.

level=error ts=2022-04-04T09:55:57.399576869Z caller=groupcache.go:272 msg="failed fetching data from groupcache" err="X-Ms-Error-Code: [BlobNotFound]" key=content:01FZFAWHCKHA9J30RJ2KA7AWZ8/deletion-mark.json

What you expected to happen: Don't see these errors on the logs.

How to reproduce it (as minimally and precisely as possible): Just enablg Group Cache Feature

Full logs to relevant components:

level=error ts=2022-04-04T09:55:57.399576869Z caller=groupcache.go:272 msg="failed fetching data from groupcache" err="X-Ms-Error-Code: [BlobNotFound]" key=content:01FZFAWHCKHA9J30RJ2KA7AWZ8/deletion-mark.json

Anything else we need to know: N/A

GiedriusS commented 2 years ago

The problem is a bit bigger - there are object storage operation "failures" coming from this, leading to false alerts. Need to think about how to solve this.

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

xiaonanshen-ponyai commented 2 years ago

Seems groupcache also causes increased number of get operations to the object storage, and almost all of them are failures (I believe all of them are for the deletion mark). oie_ilxuGAuYwsXk

dragoangel commented 1 year ago

When I enable groupcache I almost instantly faced alert: (sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-store.*"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m])) * 100 > 5) where value quickly become almost 100% while without groupcache there was no errors, version of Thanos is 0.31.0