Open apentori opened 5 months ago
Monitoring already available: https://github.com/status-im/infra-hq/blob/3b0e8240fb83fd84d1a19f43f2d0de45b0ded316/ansible/roles/prometheus-slave/files/rules/nimbus.yml#L6.
I couldn't find im the VictorOps history if the alert was correctly created
What about metrics? Has the count risen during the Holesky fleet slashing?
yes, the metrics had some change over time, for example on geth-07
, between the 30/01 and 02/02:
We could simplify the alert condition to:
validator_monitor_slashed > 0
Okay, I see how this works, there's a `` label which shows the whole count for the node:
validator_monitor_slashed{fleet="nimbus.prater", validator="total"}
And rest is for per-validator using beginning of public key. But we don't need to look at that, it's easier to look at the total.
The issue with validator_monitor_slashed > 0
is that it will trigger for those hosts that already have slashed validators:
Which means that alone is not useful. We need to detect an increase in that metric, not just above zero value.
For example, here we can see that the count just goes up:
validator_monitor_slashed{
instance="geth-09.ih-eu-mda1.nimbus.holesky", fleet="nimbus.holesky", validator="total"
}
But here we can see it's just a spike when we use delta()
or increase()
:
increase(validator_monitor_slashed{
instance="geth-09.ih-eu-mda1.nimbus.holesky", fleet="nimbus.holesky", validator="total"
}[1m])
If we incease the time range to 15 minutes we can see more spikes:
But it doesn't last, so the alert clears quickly.
What we use right now is:
sum by (instance,container)(delta(validator_monitor_slashed[15m])) > 0
There's at least two ways we can improve this:
sum by (instance,container)(increase(validator_monitor_slashed{validator="total"}[15m])) > 0
By using increase()
and adding validator="total"
. But that doesn't really solve the issue of it clearing in 15 minutes.
Prometheus sends alerts every time it evaluates the check and it's false, but once it clear the alert disappears:
Prometheus's alerting rules are good at figuring what is broken right now, but they are not a fully-fledged notification solution. Another layer is needed to add summarization, notification rate limiting, silencing and alert dependencies on top of the simple alert definitions.
I see no clear way to make such an alert last using Alertmanager configuration: https://prometheus.io/docs/alerting/latest/configuration/
So it seems the only way to make it last longer is doing some metric query wizardry: https://prometheus.io/docs/prometheus/latest/querying/functions/
Also, there is another way, which is not great, but we could ask for another metric:
# HELP validator_monitor_slashed_since_start Incremented by 1 for each slashed validator during this run.
# TYPE validator_monitor_slashed_since_start gauge
validator_monitor_slashed_since_start 0.0
Not sure if this would be accepted by nimbus-eth2
developers, but the only other alternatives are:
I guess we never came to a conclusion on how to fix this. Seems like a new metric would make this much easier.
What
validator_monitor_slashed
Why
The issue follow the incident of the 2024/01/31 on holesky where Validator were slashed due to BN and VC using the same validator keys.