Open cortze opened 1 month ago
Great catch, which definitely deserves a deeper look! Two quick questions:
voluntary_exists
one?We can unsubscribe from the voluntary_exits
for now as a quick fix 👍🏻
On the long run, could we copy the validation logic for this topic over to hermes
as well?
replying to @yiannisbot
why would we see this behaviour only after 2-2.5hrs and not continuously? We suspect receipt of those invalid messages is a random event which happened to start after 2hrs of running our node/experiment?
Voluntary exits are messages with a rather short frequency, as they represent a validator sending their voluntary exit
from the list of active validators. Thus, they are pretty sporadical.
given that meshes and peer scores are per topic, why would our node get PRUNE'd from topics other than the voluntary_exists one?
If your score gets too low, it can actually affect other topics as well ->
... The score is computed across all (configured) topics with a weighted mix, such that faulty behaviour in one topic percolates to other topics. ....
...
Heartbeat Maintenance
The score is checked explicitly during heartbeat maintenance such that:
- Peers with a negative score are pruned from all meshes.
to @guillaumemichel
We can unsubscribe from the voluntary_exits for now as a quick fix
I've applied the quick-fix for the night run I did locally. (fingers-crossed) If that improves the mesh connectivity, I'll add a quick patch and think of a more long-term solution.
I thought that we could easily fetch the list of active validators right at the start of the tool from our trusted Prysm node and then start judging whether the exit is valid or not to modify that list on the go 🤷🏽
Description
We've seen that after 2 to 2.5 hours of running Hermes starts experiencing sudden spikes in the GRAFT and PRUNE events affecting all the topics.
Although we couldn't see any direct implication in the number of peers in each mesh, it is a clear concern that could point to a decreasing peerscore that could prevent us from establishing stable connections with other nodes on meshes.
Due to the lack of message validation on each PubSub topic, it is possible that our node is forwarding non-valid messages to our mesh nodes, decreasing our score.
This is something that has been already present at our control Prysm node, where
erigon/caplin
peers have been sending non-validvolintary_exits
.Possible Solution
Suggest to not subscribe to the
voluntary_exists
for now. The interest on debugging that particular topic is rather low, and seems to be isolated to only that one.