objstore: Azure: unusually high number of GetBlobProperties calls with ClientOtherError/404 responses

thewisenerd commented 1 year ago

Thanos, Prometheus and Golang version used: thanos=0.26.0, quay.io images

Object Storage Provider: Azure

What happened: ever-increasing number of GetBlobProperties call month-over-month, with most of them (95%+) resulting in ClientOtherError/404 responses

What you expected to happen: a fairly lower number of GetBlobProperties calls..

How to reproduce it (as minimally and precisely as possible):

configure objstore/Azure on a storage account with "Enable Hierarchical namespace" enabled
wait for compactor to delete block directories after compaction
wait for compactor on next run, and every subsequent run to show these directories under found partially uploaded block and deleted partially uploaded block

Full logs to relevant components:

level=info ts=2023-06-04T14:03:08.553692356Z caller=clean.go:49 msg="found partially uploaded block; marking for deletion" block=01GXX4BC14YSF36NA9G2210XAE
level=info ts=2023-06-04T14:03:08.658578215Z caller=clean.go:59 msg="deleted aborted partial upload" block=01GXX4BC14YSF36NA9G2210XAE thresholdAge=48h0m0s

Anything else we need to know:

the notes from our internal investigation,

if “Enable Hierarchical namespace” is enabled, the block directory does not get removed completely
the directories {ulid}/chunks/ and {ulid}/ remain even after deletion of all files within {ulid}/
these end up in BestEffortCleanAbortedPartialUploads on the next run due to a missing {ulid}/meta.json
BestEffortCleanAbortedPartialUploads is unable to delete these directories since deleteDirRec invokes (b *Bucket) Iter and does not attempt deleting the directory itself

thewisenerd commented 1 year ago

i realize the thanos version is quite old, and the objstore module split and sdk upgrade (0.29.0+) happened; however, please do not ask me to upgrade to 0.29.0 and check if that fixes the issue, we are not in a position to do that currently.

i can attempt to setup thanos locally and see if I can reproduce the issue on 0.29.0+ but no guarantees on when I can get back with the results.

ahurtaud commented 1 year ago

Hello, not using "hierarchical namespace" here. Using latest thanos version. The blocks ULID get removed properly by the compactor, however I think the 404 is the only way the compactor can check if deletion-mark.json exist or not per block ULID.

We are not considering this as an error. Also this issue should be moved to https://github.com/thanos-io/objstore

bck01215 commented 9 months ago

We also get these errors at an apparent higher rate. We are on the latest thanos version. 0.32.5

These errors have been around from before I came on board but never caused a noticeable impact to my knowledge.

thewisenerd commented 9 months ago

@bck01215 any comment on whether the Azure storage account has "Hierarchical namespace" enabled?

bck01215 commented 9 months ago

@bck01215 any comment on whether the Azure storage account has "Hierarchical namespace" enabled?

That is a difference. I do not have that enabled

Tiduster commented 8 months ago

Hi all,

@bck01215 : we have the same issue with currently 3.8M+ calls per month on the Azure Storage. This is costing us almost 1k€ / month just in API "GetBlobProperties" calls.

"never caused a noticeable impact to my knowledge"

Can you look at your storage cost to see if you have the same issue?

Best regards,

bck01215 commented 8 months ago

@bck01215 : we have the same issue with currently 3.8M+ calls per month on the Azure Storage. This is costing us almost 1k€ / month just in API "GetBlobProperties" calls.

"never caused a noticeable impact to my knowledge"

Can you look at your storage cost to see if you have the same issue?

@Tiduster Unfortunately, I don't have access to our billing info. We're in the process of migrating to an on prem s3 server. Before this I tried to increase the timeouts in the http configs, but that did not resolve the issue

bck01215 commented 8 months ago

After reaching out to the billing team I confirmed the failed requests are affecting our billing. I was also able to confirm the source is coming from block.BaseFetcher caller=fetcher.go:487. This only seems to be occurring from the store (every 3 minutes) and the compactor.

deletion-mark.json After turning on verbose logging it looks like all the requests go to this endpoint. This seems to confirm what @ahurtaud is saying. The downside to this seems to be a huge uptick in costs due to failed requests. Unsure why my rate of failures would be so much higher than his however. Perhaps he scheduled his compactor to run less frequently?

6fears7 commented 8 months ago

It seems like #2565 explored the 404's related to the deletion-mark.json as well, with users pointing to Azure's internal lib handling the error notification as seen here in Azure's sdk, though this addresses the symptom and not the issue of very high GetBlobs for deletion.

Tiduster commented 8 months ago

Thank you very much, @bck01215, for verifying your cost figures. We have experienced an exponential cost increase over the past few months on our end.

Here's what we've done:

We followed this tutorial to enhance the performance of the compactor: https://thanos.io/tip/operating/compactor-backlog.md/
We noticed that if the compactor lags, it retries a significant number of queries in the storage account, which substantially increases the overall cost.
We conducted a thorough purge of the folder and discovered orphan chunks within the storage account that did not comply with our retention policy.
We also upgraded our stack to the latest Thanos version; previously, we were using version 0.29.
After the cleanup and the increase in the compactor's computing capacity, we observed a significant decrease in cost (and API calls). We went from spending 25€ per day to just 0.8€.
We are now monitoring several metrics to determine if the cost will escalate again, as we are attempting to extend our retention duration.

thanos-io / thanos

objstore: Azure: unusually high number of GetBlobProperties calls with ClientOtherError/404 responses #6412