Open catwith1hat opened 1 month ago
Could you please provide execution arguments of nimbus_beacon_node
binary?
/home/user/nimbus_beacon_node --network=holesky --jwt-secret=/jwt.hex \
--udp-port=${PORT} --tcp-port=${PORT} --data-dir=/nimbus-data \
--el=http://localhost:8640 --enr-auto-update --metrics --metrics-port=11140 \
--metrics-address=0.0.0.0 --rest --rest-port=4040 --rest-address=0.0.0.0 \
--suggested-fee-recipient=${ADDR} --doppelganger-detection=off \
--history=prune --web3-signer-update-interval=300 \
--in-process-validators=false --payload-builder=true \
--payload-builder-url=http://localhost:18920
Could you please confirm that you are using latest release version of nimbus-eth2
- 24.5.1
and you have not specified --listen-address
CLI option.
@catwith1hat Do you also have the full log line for:
WRN 2024-05-29 18:00:03.718+00:00 Peer count low, no new peers discovered >
?
Normally this log line should also print the discovered_nodes
, new_peers
, current_peers
and wanted_peers
.
@cheatfate I can confirm that I haven't set the --listen-address
CLI option:
$ ps axu | grep beacon | grep listen-add | wc
0 0 0
@kdeme: Sorry, I truncated the line while copying it. Here is the full line:
WRN 2024-05-29 18:00:03.718+00:00 Peer count low, no new peers discovered topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160
Sorry, I truncated the line while copying it. Here is the full line:
Thank you, that's very useful.
In terms of reproducing this: Do I understand correctly that you are running a Nimbus Docker container inside a QEMU VM?
edit:
As I cannot reproduce this myself, it would be really good to know the exact setup, as I think this will be some setup specific issue. I also think that the Discovery send failed msg="(101) Network is unreachable"
log is not something that would show up when run from a Docker container, well, at least not in network-bridge mode.
Additional question, did all the Peer count low, no new peers discovered
lines give discovered_nodes
set to 0?
In terms of reproducing this: Do I understand correctly that you are running a Nimbus Docker container inside a QEMU VM?
That's correct.
Additional question, did all the Peer count low, no new peers discovered lines give discovered_nodes set to 0?
Pretty much:
$ journalctl -u podman-nimbus-N4-I0.service --since="2024-05-29 16:00:00" --until="2024-05-29 20:00:00" | \
grep -oE "Peer count low.*" | \
awk '
$0 != prev {
if (count > 1) {
print count "x" prev
}
prev = $0
count = 1
}
$0 == prev {
count++
}
END {
if (count > 1) {
print count "x" prev
}
}'
2xPeer count low, no new peers discovered topics="networking" discovered_nodes=0 new_peers=@[] current_peers=36 wanted_peers=160
2xPeer count low, no new peers discovered topics="networking" discovered_nodes=0 new_peers=@[] current_peers=35 wanted_peers=160
2xPeer count low, no new peers discovered topics="networking" discovered_nodes=1 new_peers=@[] current_peers=0 wanted_peers=160
5xPeer count low, no new peers discovered topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160
4xPeer count low, no new peers discovered topics="networking" discovered_nodes=1 new_peers=@[] current_peers=0 wanted_peers=160
3xPeer count low, no new peers discovered topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160
4xPeer count low, no new peers discovered topics="networking" discovered_nodes=1 new_peers=@[] current_peers=0 wanted_peers=160
5xPeer count low, no new peers discovered topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160
2xPeer count low, no new peers discovered topics="networking" discovered_nodes=1 new_peers=@[] current_peers=0 wanted_peers=160
29xPeer count low, no new peers discovered topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160
(the awk script counts how many times a line repeats)
As I cannot reproduce this myself, it would be really good to know the exact setup, as I think this will be some setup specific issue. I also think that the Discovery send failed msg="(101) Network is unreachable" log is not something that would show up when run from a Docker container, well, at least not in network-bridge mode.
This is probably correct that you would not get Network is unreachable
. It is true that my networking inside the docker container is non-standard. I use my personal equivalent to gluetun that sets up VPN networking inside a container, and the nimbus container attaches to this docker's network. When I cut the link, the default route for Nimbus disappears after openvpn dies. However, the networking comes back, as well as a new default route. So I do believe that getting stuck is still an undesirable behavior of Nimbus.
But spotting Network is unreachable
might be a great catch. Hypothesis: Maybe Nimbus reacts differently to a socket error that returns "Network unreachable" instead of a connection that simply times out? Maybe Nimbus treats the former as a more permanent error when connecting to a peer, such that this peer will never be tried again in the future?
@kdeme If you have your setup still at hand, would you mind trying to reproduce this by remove the default route of the docker container (or the whole VM/host) for let's say 60 minutes?
Two more datapoints:
podman exec -it nimbus /bin/bash
and then using bash's built in TCP support to connect to some random website.
Describe the bug When a node looses connectivity for an extended period of time, it eventually exhausts trying all peers. After it has unsuccessfully tried each peer and after connectivity is restored, the node does not heal. It tries to discover new peers, but can't find any. I straced the beacon_node binary to see what's going on, and it seems that logging to syslog is the only activity of the node. The node quickly recovers with a restart.
To Reproduce Cut connectivity on a Holesky node for about 40 minutes till you see the "Peer low" warning. Restore connection. Observe that the node does not heal. If you are using a node in a libvirt/qemu VM, you can easily toggle the link connectivity in the virt-manager interface.
Screenshots
Our test starts off with a working node with 158 peers.
I cut connection around 17:22
Around 18:00, I restore the connection. The "Discovery send failed" stops, but no new peers are discovered.
From that I assume that some kind of Discovery is being done by Nimbus. But it probably only finds peers that it already knew about and that it marked as bad during the period of lost connectivity. Two hours later, the node is still stuck at this:
I restart the node and things go back to normal quickly:
As you see above, the node immediately picks up 22 peers at 19:53:49.
Additional context Nimbus 24.5.1 from official Docker image.