thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.01k stars 2.08k forks source link

objstore: Azure: unusually high number of GetBlobProperties calls with ClientOtherError/404 responses #6412

Open thewisenerd opened 1 year ago

thewisenerd commented 1 year ago

Thanos, Prometheus and Golang version used: thanos=0.26.0, quay.io images

Object Storage Provider: Azure

What happened: ever-increasing number of GetBlobProperties call month-over-month, with most of them (95%+) resulting in ClientOtherError/404 responses

What you expected to happen: a fairly lower number of GetBlobProperties calls..

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

level=info ts=2023-06-04T14:03:08.553692356Z caller=clean.go:49 msg="found partially uploaded block; marking for deletion" block=01GXX4BC14YSF36NA9G2210XAE
level=info ts=2023-06-04T14:03:08.658578215Z caller=clean.go:59 msg="deleted aborted partial upload" block=01GXX4BC14YSF36NA9G2210XAE thresholdAge=48h0m0s

Anything else we need to know:

the notes from our internal investigation,

thewisenerd commented 1 year ago

i realize the thanos version is quite old, and the objstore module split and sdk upgrade (0.29.0+) happened; however, please do not ask me to upgrade to 0.29.0 and check if that fixes the issue, we are not in a position to do that currently.

i can attempt to setup thanos locally and see if I can reproduce the issue on 0.29.0+ but no guarantees on when I can get back with the results.

ahurtaud commented 1 year ago

Hello, not using "hierarchical namespace" here. Using latest thanos version. The blocks ULID get removed properly by the compactor, however I think the 404 is the only way the compactor can check if deletion-mark.json exist or not per block ULID.

Screenshot 2023-06-05 at 12 35 14

We are not considering this as an error. Also this issue should be moved to https://github.com/thanos-io/objstore

bck01215 commented 9 months ago

image We also get these errors at an apparent higher rate. We are on the latest thanos version. 0.32.5

These errors have been around from before I came on board but never caused a noticeable impact to my knowledge. image

thewisenerd commented 9 months ago

@bck01215 any comment on whether the Azure storage account has "Hierarchical namespace" enabled?

bck01215 commented 9 months ago

@bck01215 any comment on whether the Azure storage account has "Hierarchical namespace" enabled?

That is a difference. I do not have that enabled

Tiduster commented 8 months ago

Hi all,

@bck01215 : we have the same issue with currently 3.8M+ calls per month on the Azure Storage. This is costing us almost 1k€ / month just in API "GetBlobProperties" calls.

"never caused a noticeable impact to my knowledge"

Can you look at your storage cost to see if you have the same issue?

Best regards,

bck01215 commented 8 months ago

@bck01215 : we have the same issue with currently 3.8M+ calls per month on the Azure Storage. This is costing us almost 1k€ / month just in API "GetBlobProperties" calls.

"never caused a noticeable impact to my knowledge"

Can you look at your storage cost to see if you have the same issue?

@Tiduster Unfortunately, I don't have access to our billing info. We're in the process of migrating to an on prem s3 server. Before this I tried to increase the timeouts in the http configs, but that did not resolve the issue

bck01215 commented 8 months ago

After reaching out to the billing team I confirmed the failed requests are affecting our billing. I was also able to confirm the source is coming from block.BaseFetcher caller=fetcher.go:487. This only seems to be occurring from the store (every 3 minutes) and the compactor.

deletion-mark.json After turning on verbose logging it looks like all the requests go to this endpoint. This seems to confirm what @ahurtaud is saying. The downside to this seems to be a huge uptick in costs due to failed requests. Unsure why my rate of failures would be so much higher than his however. Perhaps he scheduled his compactor to run less frequently?

6fears7 commented 8 months ago

It seems like #2565 explored the 404's related to the deletion-mark.json as well, with users pointing to Azure's internal lib handling the error notification as seen here in Azure's sdk, though this addresses the symptom and not the issue of very high GetBlobs for deletion.

Tiduster commented 8 months ago

Thank you very much, @bck01215, for verifying your cost figures. We have experienced an exponential cost increase over the past few months on our end.

Here's what we've done: