Open thewisenerd opened 1 year ago
i realize the thanos version is quite old, and the objstore
module split and sdk upgrade (0.29.0+) happened; however, please do not ask me to upgrade to 0.29.0 and check if that fixes the issue, we are not in a position to do that currently.
i can attempt to setup thanos locally and see if I can reproduce the issue on 0.29.0+ but no guarantees on when I can get back with the results.
Hello, not using "hierarchical namespace" here. Using latest thanos version. The blocks ULID get removed properly by the compactor, however I think the 404 is the only way the compactor can check if deletion-mark.json exist or not per block ULID.
We are not considering this as an error. Also this issue should be moved to https://github.com/thanos-io/objstore
We also get these errors at an apparent higher rate. We are on the latest thanos version. 0.32.5
These errors have been around from before I came on board but never caused a noticeable impact to my knowledge.
@bck01215 any comment on whether the Azure storage account has "Hierarchical namespace" enabled?
@bck01215 any comment on whether the Azure storage account has "Hierarchical namespace" enabled?
That is a difference. I do not have that enabled
Hi all,
@bck01215 : we have the same issue with currently 3.8M+ calls per month on the Azure Storage. This is costing us almost 1k€ / month just in API "GetBlobProperties" calls.
"never caused a noticeable impact to my knowledge"
Can you look at your storage cost to see if you have the same issue?
Best regards,
@bck01215 : we have the same issue with currently 3.8M+ calls per month on the Azure Storage. This is costing us almost 1k€ / month just in API "GetBlobProperties" calls.
"never caused a noticeable impact to my knowledge"
Can you look at your storage cost to see if you have the same issue?
@Tiduster Unfortunately, I don't have access to our billing info. We're in the process of migrating to an on prem s3 server. Before this I tried to increase the timeouts in the http configs, but that did not resolve the issue
After reaching out to the billing team I confirmed the failed requests are affecting our billing. I was also able to confirm the source is coming from block.BaseFetcher
caller=fetcher.go:487
. This only seems to be occurring from the store (every 3 minutes) and the compactor.
deletion-mark.json
After turning on verbose logging it looks like all the requests go to this endpoint. This seems to confirm what @ahurtaud is saying. The downside to this seems to be a huge uptick in costs due to failed requests. Unsure why my rate of failures would be so much higher than his however. Perhaps he scheduled his compactor to run less frequently?
It seems like #2565 explored the 404's related to the deletion-mark.json
as well, with users pointing to Azure's internal lib handling the error notification as seen here in Azure's sdk, though this addresses the symptom and not the issue of very high GetBlobs for deletion.
Thank you very much, @bck01215, for verifying your cost figures. We have experienced an exponential cost increase over the past few months on our end.
Here's what we've done:
Thanos, Prometheus and Golang version used: thanos=0.26.0, quay.io images
Object Storage Provider: Azure
What happened: ever-increasing number of
GetBlobProperties
call month-over-month, with most of them (95%+) resulting inClientOtherError
/404 responsesWhat you expected to happen: a fairly lower number of
GetBlobProperties
calls..How to reproduce it (as minimally and precisely as possible):
found partially uploaded block
anddeleted partially uploaded block
Full logs to relevant components:
Anything else we need to know:
the notes from our internal investigation,
{ulid}/chunks/
and{ulid}/
remain even after deletion of all files within{ulid}/
BestEffortCleanAbortedPartialUploads
on the next run due to a missing{ulid}/meta.json
BestEffortCleanAbortedPartialUploads
is unable to delete these directories sincedeleteDirRec
invokes(b *Bucket) Iter
and does not attempt deleting the directory itself