probe-lab / hermes

A Gossipsub listener and tracer.
Other
10 stars 5 forks source link

Broadcasting of invalid `voluntary_exit` messages to mesh peers #24

Open cortze opened 1 month ago

cortze commented 1 month ago

Description

We've seen that after 2 to 2.5 hours of running Hermes starts experiencing sudden spikes in the GRAFT and PRUNE events affecting all the topics.

Although we couldn't see any direct implication in the number of peers in each mesh, it is a clear concern that could point to a decreasing peerscore that could prevent us from establishing stable connections with other nodes on meshes.

Due to the lack of message validation on each PubSub topic, it is possible that our node is forwarding non-valid messages to our mesh nodes, decreasing our score.

This is something that has been already present at our control Prysm node, where erigon/caplin peers have been sending non-valid volintary_exits.

time="2024-05-16 12:35:01" level=debug msg="Gossip message was rejected" agent="erigon/caplin" error="non-active validator cannot exit" gossipScore=-6182.725625534806 multiaddress="/ip4/120.31.71.167/tcp/55742" peerID=16Uiu2HAkzNLy2S3voLw3CFxET1kXYSZVLV6QwkHuP3RaDdGJSk2E prefix=sync topic="/eth2/6a95a1a9/voluntary_exit/ssz_snappy"
time="2024-05-16 12:35:01" level=debug msg="Gossip message was rejected" agent="erigon/caplin" error="non-active validator cannot exit" gossipScore=-6182.725625534806 multiaddress="/ip4/120.31.71.167/tcp/55742" peerID=16Uiu2HAkzNLy2S3voLw3CFxET1kXYSZVLV6QwkHuP3RaDdGJSk2E prefix=sync topic="/eth2/6a95a1a9/voluntary_exit/ssz_snappy"

Possible Solution

Suggest to not subscribe to the voluntary_exists for now. The interest on debugging that particular topic is rather low, and seems to be isolated to only that one.

yiannisbot commented 1 month ago

Great catch, which definitely deserves a deeper look! Two quick questions:

guillaumemichel commented 1 month ago

We can unsubscribe from the voluntary_exits for now as a quick fix 👍🏻

On the long run, could we copy the validation logic for this topic over to hermes as well?

cortze commented 1 month ago

replying to @yiannisbot

why would we see this behaviour only after 2-2.5hrs and not continuously? We suspect receipt of those invalid messages is a random event which happened to start after 2hrs of running our node/experiment?

Voluntary exits are messages with a rather short frequency, as they represent a validator sending their voluntary exit from the list of active validators. Thus, they are pretty sporadical.

given that meshes and peer scores are per topic, why would our node get PRUNE'd from topics other than the voluntary_exists one?

If your score gets too low, it can actually affect other topics as well ->

... The score is computed across all (configured) topics with a weighted mix, such that faulty behaviour in one topic percolates to other topics. ....
...
Heartbeat Maintenance
The score is checked explicitly during heartbeat maintenance such that:
- Peers with a negative score are pruned from all meshes.

to @guillaumemichel

We can unsubscribe from the voluntary_exits for now as a quick fix

I've applied the quick-fix for the night run I did locally. (fingers-crossed) If that improves the mesh connectivity, I'll add a quick patch and think of a more long-term solution.

I thought that we could easily fetch the list of active validators right at the start of the tool from our trusted Prysm node and then start judging whether the exit is valid or not to modify that list on the go 🤷🏽