Open aknuds1 opened 2 years ago
Set the critical alert to > 2 WAL corruptions in the last hour, assuming the worst case of all 3 corruptions being in a different zone
I agree on this, but I think it should be >= 2 WAL corruptions over the last 3h (because TSDB head could contain samples up to 3h old). I'm saying >= 2 because with RF=3 and quorum=2, if we have 2 corrupted WALs they could both contain the only 2 copies of a given sample.
In case the cluster is running with multi-zones, replaces ">= 2 WALs" with "ingesters in >= 2 zones".
Thanks for the helpful input, Marco!
Is your feature request related to a problem? Please describe.
The
MimirIngesterTSDBWALCorrupted
causes pages due to it being of critical severity, even if it gets handled automatically by the affected ingester(s) and there's nothing for the paged engineer to do but investigate. The investigation is also usually mundane, as the typical reason is a Kubernetes pod having been terminated abruptly due to re-scheduling.Describe the solution you'd like
We should consider if it's possible to distinguish between critical and non-critical cases of
MimirIngesterTSDBWALCorrupted
, so engineers only get paged about critical cases (which are hopefully rare and actually in need of human intervention).Describe alternatives you've considered
@codesome has provided the following ideas:
Additional context
This alert seems to fire quite often - I was paged for it three times last week, and all cases were just one ingester and handled automatically.
Some context from @codesome on the
MimirIngesterTSDBWALCorrupted
alert: