Open grafanabot opened 3 years ago
Thank you for detailed report. As you point out, this needs to be fixed in Thanos first. We can keep this issue open until we import updated Thanos into Cortex.
Submitted by: pstibrany
Thanks for your report.
There's a thing I can't understand tho. You mentioned the compact has logged uploaded file
on 01F16ZRT8TYA08VJQR1ZPCC5EP/index
: this log gets logged once the upload successfully completes (from the client perspective) so, for the compactor, the index was successfully uploaded.
Submitted by: pracucci
From my understanding, an error happened in Upload(), but was only caught during the deferred CloseWithLogOnErr(). Upload() then falsely returned "success" to the compactor as it did not know that the Close() had failed. Based on the return value of Upload(), the compactor thinks the upload was a success, even though it failed.
Submitted by: ubcharron
this log gets logged once the upload successfully completes (from the client perspective) so, for the compactor, the index was successfully uploaded.
When using Swift, Upload
returns success even if it fails when closing the writer :(
Submitted by: pstibrany
When using Swift, Upload returns success even if it fails when closing the writer
Oh, it's a Swift client issue! Let's fix it ;)
Submitted by: pracucci
This issue is resolved as of Thanos v0.21.0. Would it be possible to update the Thanos vendor pkg, or otherwise cherry-pick this the PR? (https://github.com/thanos-io/thanos/pull/4218)
Submitted by: bcharron
Describe the bug
The compactor somehow failed to upload the block's index file to Swift, but still deleted the source blocks. There are warnings in the logs, but the compactor does not seem to be aware of them. We lost one day of metrics for our main tenant. (I was hoping to be able to re-generate the index file from the chunks, but that doesn't seem possible as the chunk files only have samples, not the labels themselves.)
We opened a bug in Thanos (https://github.com/thanos-io/thanos/issues/3958), but we're wondering if Cortex would be the more relevant place for it?
To Reproduce
We're not sure how it happens, so here's our best attempt at recollection:
Running Cortex 1.7.0, the Compactor compacted a series of blocks. It then uploaded all resulting files to Swift, but the index file never made it to Swift. In Swift's own logs, there are no traces of the index file ever being uploaded. We /think/ an error might have been detected by "CloseWithLogOnErr", but never made its way back to the Compactor (since it runs as deferred) and thus ignored.
See logs below.
Expected behavior
The Compactor would retry sending a file if there is an error.
Environment:
Storage Engine
Additional Context
Compactor logs:
Submitted by: ubcharron Cortex Issue Number: 4055