status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
8 stars 6 forks source link

Adding monitoring on slashing metrics #166

Open apentori opened 5 months ago

apentori commented 5 months ago

What

Why

The issue follow the incident of the 2024/01/31 on holesky where Validator were slashed due to BN and VC using the same validator keys.

apentori commented 5 months ago

Monitoring already available: https://github.com/status-im/infra-hq/blob/3b0e8240fb83fd84d1a19f43f2d0de45b0ded316/ansible/roles/prometheus-slave/files/rules/nimbus.yml#L6.

I couldn't find im the VictorOps history if the alert was correctly created

jakubgs commented 5 months ago

What about metrics? Has the count risen during the Holesky fleet slashing?

apentori commented 5 months ago

yes, the metrics had some change over time, for example on geth-07, between the 30/01 and 02/02:

1708506715_grim

We could simplify the alert condition to:

validator_monitor_slashed > 0 
jakubgs commented 5 months ago

Okay, I see how this works, there's a `` label which shows the whole count for the node:

validator_monitor_slashed{fleet="nimbus.prater", validator="total"}

image

And rest is for per-validator using beginning of public key. But we don't need to look at that, it's easier to look at the total.

The issue with validator_monitor_slashed > 0 is that it will trigger for those hosts that already have slashed validators:

image

Which means that alone is not useful. We need to detect an increase in that metric, not just above zero value.

jakubgs commented 5 months ago

For example, here we can see that the count just goes up:

validator_monitor_slashed{
    instance="geth-09.ih-eu-mda1.nimbus.holesky", fleet="nimbus.holesky", validator="total"
}

image

But here we can see it's just a spike when we use delta() or increase():

increase(validator_monitor_slashed{
    instance="geth-09.ih-eu-mda1.nimbus.holesky", fleet="nimbus.holesky", validator="total"
}[1m])

image

If we incease the time range to 15 minutes we can see more spikes:

image

But it doesn't last, so the alert clears quickly.

jakubgs commented 5 months ago

What we use right now is:

sum by (instance,container)(delta(validator_monitor_slashed[15m])) > 0

There's at least two ways we can improve this:

sum by (instance,container)(increase(validator_monitor_slashed{validator="total"}[15m])) > 0

By using increase() and adding validator="total". But that doesn't really solve the issue of it clearing in 15 minutes.

Prometheus sends alerts every time it evaluates the check and it's false, but once it clear the alert disappears:

Prometheus's alerting rules are good at figuring what is broken right now, but they are not a fully-fledged notification solution. Another layer is needed to add summarization, notification rate limiting, silencing and alert dependencies on top of the simple alert definitions.

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#sending-alert-notifications

jakubgs commented 5 months ago

I see no clear way to make such an alert last using Alertmanager configuration: https://prometheus.io/docs/alerting/latest/configuration/

So it seems the only way to make it last longer is doing some metric query wizardry: https://prometheus.io/docs/prometheus/latest/querying/functions/

jakubgs commented 5 months ago

Also, there is another way, which is not great, but we could ask for another metric:

# HELP validator_monitor_slashed_since_start Incremented by 1 for each slashed validator during this run.
# TYPE validator_monitor_slashed_since_start gauge
validator_monitor_slashed_since_start 0.0

Not sure if this would be accepted by nimbus-eth2 developers, but the only other alternatives are:

  1. Dark magic with Prometheus query to keep the value >0 for longer.
    • Not sure if at all doable and probably would be a horrible hack.
  2. Removing slashed validator keys from affected nodes.
    • Jacek expressed approval for running nodes with slashed validators to test that part of the code.
jakubgs commented 3 weeks ago

I guess we never came to a conclusion on how to fix this. Seems like a new metric would make this much easier.