grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.16k stars 535 forks source link

Lost data during compaction on Swift #90

Open grafanabot opened 3 years ago

grafanabot commented 3 years ago

Describe the bug

The compactor somehow failed to upload the block's index file to Swift, but still deleted the source blocks. There are warnings in the logs, but the compactor does not seem to be aware of them. We lost one day of metrics for our main tenant. (I was hoping to be able to re-generate the index file from the chunks, but that doesn't seem possible as the chunk files only have samples, not the labels themselves.)

We opened a bug in Thanos (https://github.com/thanos-io/thanos/issues/3958), but we're wondering if Cortex would be the more relevant place for it?

To Reproduce

We're not sure how it happens, so here's our best attempt at recollection:

Running Cortex 1.7.0, the Compactor compacted a series of blocks. It then uploaded all resulting files to Swift, but the index file never made it to Swift. In Swift's own logs, there are no traces of the index file ever being uploaded. We /think/ an error might have been detected by "CloseWithLogOnErr", but never made its way back to the Compactor (since it runs as deferred) and thus ignored.

See logs below.

Expected behavior

The Compactor would retry sending a file if there is an error.

Environment:

Storage Engine

Additional Context

Compactor logs:

{  
  "caller": "runutil.go:124",
  "err": "upload object close: Timeout when reading or writing data",
  "level": "warn",
  "msg": "detected close error",
  "ts": "2021-03-20T05:12:44.877771796Z"
}
{
  "bucket": "tracing: cortex-tsdb-prod04",
  "caller": "objstore.go:159",
  "component": "compactor",
  "dst": "01F16ZRT8TYA08VJQR1ZPCC5EP/index",
  "from": "data/compact/0@14583055817248146110/01F16ZRT8TYA08VJQR1ZPCC5EP/index",
  "group": "0@{__org_id__=\"1\"}",
  "groupKey": "0@14583055817248146110",
  "level": "debug",
  "msg": "uploaded file",
  "org_id": "1",
  "ts": "2021-03-20T05:12:44.877834603Z"
}
{
  "caller": "compact.go:810",
  "component": "compactor",
  "duration": "4m41.662527735s",
  "group": "0@{__org_id__=\"1\"}",
  "groupKey": "0@14583055817248146110",
  "level": "info",
  "msg": "uploaded block",
  "org_id": "1",
  "result_block": "01F16ZRT8TYA08VJQR1ZPCC5EP",
  "ts": "2021-03-20T05:12:45.140243007Z"
}
{
  "caller": "compact.go:832",
  "component": "compactor",
  "group": "0@{__org_id__=\"1\"}",
  "groupKey": "0@14583055817248146110",
  "level": "info",
  "msg": "marking compacted block for deletion",
  "old_block": "01F15H6D6CXE1ASE788HQECHM4",
  "org_id": "1",
  "ts": "2021-03-20T05:12:45.627586825Z"
}
$ openstack object list cortex-tsdb-prod04 --prefix 1/01F16ZRT8TYA08VJQR1ZPCC5EP
+--------------------------------------------+
| Name                                       |
+--------------------------------------------+
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000001 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000002 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000003 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000004 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000005 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000006 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000007 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000008 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000009 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000010 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000011 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000012 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000013 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000014 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000015 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000016 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000017 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000018 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000019 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000020 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000021 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000022 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000023 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/meta.json     |
+--------------------------------------------+

Submitted by: ubcharron Cortex Issue Number: 4055

grafanabot commented 3 years ago

Thank you for detailed report. As you point out, this needs to be fixed in Thanos first. We can keep this issue open until we import updated Thanos into Cortex.

Submitted by: pstibrany

grafanabot commented 3 years ago

Thanks for your report.

There's a thing I can't understand tho. You mentioned the compact has logged uploaded file on 01F16ZRT8TYA08VJQR1ZPCC5EP/index: this log gets logged once the upload successfully completes (from the client perspective) so, for the compactor, the index was successfully uploaded.

Submitted by: pracucci

grafanabot commented 3 years ago

From my understanding, an error happened in Upload(), but was only caught during the deferred CloseWithLogOnErr(). Upload() then falsely returned "success" to the compactor as it did not know that the Close() had failed. Based on the return value of Upload(), the compactor thinks the upload was a success, even though it failed.

Submitted by: ubcharron

grafanabot commented 3 years ago

this log gets logged once the upload successfully completes (from the client perspective) so, for the compactor, the index was successfully uploaded.

When using Swift, Upload returns success even if it fails when closing the writer :(

Submitted by: pstibrany

grafanabot commented 3 years ago

When using Swift, Upload returns success even if it fails when closing the writer

Oh, it's a Swift client issue! Let's fix it ;)

Submitted by: pracucci

grafanabot commented 3 years ago

This issue is resolved as of Thanos v0.21.0. Would it be possible to update the Thanos vendor pkg, or otherwise cherry-pick this the PR? (https://github.com/thanos-io/thanos/pull/4218)

Submitted by: bcharron