grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
3.88k stars 476 forks source link

MimirIngesterTSDBWALCorrupted: Distinguish between critical and non-critical cases #1321

Open aknuds1 opened 2 years ago

aknuds1 commented 2 years ago

Is your feature request related to a problem? Please describe.

The MimirIngesterTSDBWALCorrupted causes pages due to it being of critical severity, even if it gets handled automatically by the affected ingester(s) and there's nothing for the paged engineer to do but investigate. The investigation is also usually mundane, as the typical reason is a Kubernetes pod having been terminated abruptly due to re-scheduling.

Describe the solution you'd like

We should consider if it's possible to distinguish between critical and non-critical cases of MimirIngesterTSDBWALCorrupted, so engineers only get paged about critical cases (which are hopefully rare and actually in need of human intervention).

Describe alternatives you've considered

@codesome has provided the following ideas:

Additional context

This alert seems to fire quite often - I was paged for it three times last week, and all cases were just one ingester and handled automatically.

Some context from @codesome on the MimirIngesterTSDBWALCorrupted alert:

It is prolly good to investigate why is the corruption happening. While corruption in a few ingesters in a single zone might not be an issue, but if corruption happened at the same time in > 1 zone, then there might be data loss. All 3 zones = most likely a data loss. So it cannot be totally ignored (but maybe it can be a warning if corruption was in a single zone, and a page if it happened for > 1 zone. But as a precaution, paging for 1 zone sounds good if you can prevent corruptions in other zones).

Maybe we can defer investigation to normal hours, but cannot be ignored

pracucci commented 2 years ago

Set the critical alert to > 2 WAL corruptions in the last hour, assuming the worst case of all 3 corruptions being in a different zone

I agree on this, but I think it should be >= 2 WAL corruptions over the last 3h (because TSDB head could contain samples up to 3h old). I'm saying >= 2 because with RF=3 and quorum=2, if we have 2 corrupted WALs they could both contain the only 2 copies of a given sample.

In case the cluster is running with multi-zones, replaces ">= 2 WALs" with "ingesters in >= 2 zones".

aknuds1 commented 2 years ago

Thanks for the helpful input, Marco!