Beacon node's P2P degrades permanently after 40 minutes of no connectivity

catwith1hat commented 1 month ago

Describe the bug When a node looses connectivity for an extended period of time, it eventually exhausts trying all peers. After it has unsuccessfully tried each peer and after connectivity is restored, the node does not heal. It tries to discover new peers, but can't find any. I straced the beacon_node binary to see what's going on, and it seems that logging to syslog is the only activity of the node. The node quickly recovers with a restart.

To Reproduce Cut connectivity on a Holesky node for about 40 minutes till you see the "Peer low" warning. Restore connection. Observe that the node does not heal. If you are using a node in a libvirt/qemu VM, you can easily toggle the link connectivity in the virt-manager interface.

Screenshots

Our test starts off with a working node with 158 peers.

INF 2024-05-29 17:21:10.200+00:00 Slot end                                   topics="beacnde" slot=1758405 nextActionWait=5m1s799ms425us813ns nextAttestationSlot=1758431 nextProposalSlot=-1 syncC>
INF 2024-05-29 17:21:12.000+00:00 Slot start                                 topics="beacnde" head=f404d42b:1758405 delay=755us539ns finalized=54948:53c472bd peers=158 slot=1758406 sync=synced ep>

I cut connection around 17:22

INF 2024-05-29 17:22:00.000+00:00 Slot start                                 topics="beacnde" head=7a2eb6dc:1758406 delay=390us983ns finalized=54948:53c472bd peers=159 slot=1758410 sync=synced ep>
INF 2024-05-29 17:22:10.188+00:00 Slot end                                   topics="beacnde" slot=1758410 nextActionWait=4m1s811ms580us448ns nextAttestationSlot=1758431 nextProposalSlot=-1 syncC>
INF 2024-05-29 17:22:12.000+00:00 Slot start                                 topics="beacnde" head=7a2eb6dc:1758406 delay=666us787ns finalized=54948:53c472bd peers=159 slot=1758411 sync=synced ep>
WRN 2024-05-29 17:22:17.692+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=81.99.150.0:9000
INF 2024-05-29 17:22:22.224+00:00 Slot end                                   topics="beacnde" slot=1758411 nextActionWait=3m49s775ms906us480ns nextAttestationSlot=1758431 nextProposalSlot=-1 sync>
WRN 2024-05-29 17:22:23.598+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=178.74.194.251:51394
INF 2024-05-29 17:22:24.000+00:00 Slot start                                 topics="beacnde" head=7a2eb6dc:1758406 delay=845us398ns finalized=54948:53c472bd peers=159 slot=1758412 sync=synced ep>
WRN 2024-05-29 17:22:28.543+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=24.109.179.18:9005
INF 2024-05-29 17:22:34.168+00:00 Slot end                                   topics="beacnde" slot=1758412 nextActionWait=3m37s831ms51us907ns nextAttestationSlot=1758431 nextProposalSlot=-1 syncC>
WRN 2024-05-29 17:22:34.226+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=84.247.133.155:9222
INF 2024-05-29 17:22:36.001+00:00 Slot start                                 topics="beacnde" head=7a2eb6dc:1758406 delay=1ms139us790ns finalized=54948:53c472bd peers=159 slot=1758413 sync=synced>
WRN 2024-05-29 17:22:36.526+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=162.253.182.30:12103
WRN 2024-05-29 17:22:41.302+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=35.200.101.160:9000
INF 2024-05-29 17:22:46.195+00:00 Slot end                                   topics="beacnde" slot=1758413 nextActionWait=3m25s804ms403us720ns nextAttestationSlot=1758431 nextProposalSlot=-1 sync>
INF 2024-05-29 17:22:48.000+00:00 Slot start                                 topics="beacnde" head=7a2eb6dc:1758406 delay=263us215ns finalized=54948:53c472bd peers=159 slot=1758414 sync=synced ep>
WRN 2024-05-29 17:22:50.438+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=116.202.81.184:12000
WRN 2024-05-29 17:22:54.112+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=65.109.111.121:15101
WRN 2024-05-29 17:22:56.592+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=46.37.60.78:43304
INF 2024-05-29 17:22:58.194+00:00 Slot end                                   topics="beacnde" slot=1758414 nextActionWait=3m13s805ms712us641ns nextAttestationSlot=1758431 nextProposalSlot=-1 sync>
INF 2024-05-29 17:23:00.000+00:00 Slot start                                 topics="beacnde" head=7a2eb6dc:1758406 delay=453us654ns finalized=54948:53c472bd peers=159 slot=1758415 sync=synced ep>
WRN 2024-05-29 17:23:00.673+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=178.63.53.180:9000
WRN 2024-05-29 17:23:08.540+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=185.209.178.219:4000
INF 2024-05-29 17:23:10.191+00:00 Slot end                                   topics="beacnde" slot=1758415 nextActionWait=3m1s808ms25us236ns nextAttestationSlot=1758431 nextProposalSlot=-1 syncCo>
INF 2024-05-29 17:23:12.000+00:00 Slot start                                 topics="beacnde" head=7a2eb6dc:1758406 delay=173us809ns finalized=54948:53c472bd peers=159 slot=1758416 sync=synced ep>
WRN 2024-05-29 17:23:14.209+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=93.45.39.79:51678
WRN 2024-05-29 17:23:17.068+00:00 Discovery send failed                      topics="eth p2p discv5" msg="(101) Network is unreachable" address=34.79.11.83:12000

Around 18:00, I restore the connection. The "Discovery send failed" stops, but no new peers are discovered.

WRN 2024-05-29 17:59:59.743+00:00 Discovery send failed                    >
INF 2024-05-29 18:00:00.000+00:00 Slot start                               >
WRN 2024-05-29 18:00:00.096+00:00 Discovery send failed                    >
WRN 2024-05-29 18:00:03.718+00:00 Peer count low, no new peers discovered  >
INF 2024-05-29 18:00:10.171+00:00 Slot end                                 >
INF 2024-05-29 18:00:12.001+00:00 Slot start                               >
WRN 2024-05-29 18:00:16.720+00:00 Peer count low, no new peers discovered  >
INF 2024-05-29 18:00:22.180+00:00 Slot end                                 >
INF 2024-05-29 18:00:24.000+00:00 Slot start                               >
WRN 2024-05-29 18:00:29.724+00:00 Peer count low, no new peers discovered  >
INF 2024-05-29 18:00:34.168+00:00 Slot end                                 >
INF 2024-05-29 18:00:36.001+00:00 Slot start                               >
WRN 2024-05-29 18:00:42.729+00:00 Peer count low, no new peers discovered  >
INF 2024-05-29 18:00:46.150+00:00 Slot end                                 >
INF 2024-05-29 18:00:48.000+00:00 Slot start                               >
WRN 2024-05-29 18:00:55.732+00:00 Peer count low, no new peers discovered  >
INF 2024-05-29 18:00:58.175+00:00 Slot end                                 >
INF 2024-05-29 18:01:00.000+00:00 Slot start                               >
WRN 2024-05-29 18:01:08.736+00:00 Peer count low, no new peers discovered  >
INF 2024-05-29 18:01:10.176+00:00 Slot end                                 >
INF 2024-05-29 18:01:12.001+00:00 Slot start                               >
WRN 2024-05-29 18:01:21.740+00:00 Peer count low, no new peers discovered  >

From that I assume that some kind of Discovery is being done by Nimbus. But it probably only finds peers that it already knew about and that it marked as bad during the period of lost connectivity. Two hours later, the node is still stuck at this:

WRN 2024-05-29 18:14:47.993+00:00 Peer count low, no new peers discovered  >
WRN 2024-05-29 19:36:57.577+00:00 Peer count low, no new peers discovered  >

I restart the node and things go back to normal quickly:

INF 2024-05-29 19:52:53.216+00:00 Launching beacon node                    >
NTC 2024-05-29 19:53:16.351+00:00 REST service started                       address=0.0.0.0:4040
NTC 2024-05-29 19:53:16.351+00:00 Starting light client                      topics="lightcl" trusted_block_root=none(Hash256)
INF 2024-05-29 19:53:16.352+00:00 Scheduling first slot action               topics="beacnde" startTime=34w6d7h53m16s351ms972us788ns nextSlot=1759167 timeToNextSlot=7s648ms27us212ns
INF 2024-05-29 19:53:23.804+00:00 Eth1 sync progress                         topics="elman" blockNumber=0x18e2e9 depositsProcessed=277592
INF 2024-05-29 19:53:24.000+00:00 Slot start                                 topics="beacnde" head=7a2eb6dc:1758406 delay=862us408ns finalized=54948:53c472bd peers=0 slot=1759167 sync="--h--m (99>
INF 2024-05-29 19:53:34.296+00:00 Slot end                                   topics="beacnde" slot=1759167 nextActionWait=n/a nextAttestationSlot=-1 nextProposalSlot=-1 syncCommitteeDuties=none h>
INF 2024-05-29 19:53:36.236+00:00 Slot start                                 topics="beacnde" head=6ff64b30:1758425 delay=236ms19us319ns finalized=54948:53c472bd peers=9 slot=1759168 sync="--h--m>
INF 2024-05-29 19:53:38.527+00:00 State replayed                             topics="chaindag" blocks=0 slots=2 current=deec0cba:1758431 ancestor=deec0cba:1758431 target=deec0cba:1758431@1758433 >
INF 2024-05-29 19:53:47.190+00:00 Database checkpointed                      topics="beacnde" dur=345ms71us265ns
INF 2024-05-29 19:53:47.190+00:00 Slot end                                   topics="beacnde" slot=1759168 nextActionWait=n/a nextAttestationSlot=-1 nextProposalSlot=-1 syncCommitteeDuties=none h>
INF 2024-05-29 19:53:48.463+00:00 State replayed                             topics="chaindag" blocks=0 slots=1 current=ae350c75:1758463 ancestor=ae350c75:1758463 target=ae350c75:1758463@1758464 >
INF 2024-05-29 19:53:49.345+00:00 Slot start                                 topics="beacnde" head=c18dd646:1758464 delay=1s345ms67us739ns finalized=54950:d59e9479 peers=22 slot=1759169 sync="00h>
INF 2024-05-29 19:53:58.106+00:00 State replayed                             topics="chaindag" blocks=0 slots=2 current=4d5fb847:1758494 ancestor=4d5fb847:1758494 target=4d5fb847:1758494@1758496 >
INF 2024-05-29 19:53:58.997+00:00 Missed multiple heartbeats                 topics="libp2p gossipsub" heartbeat=GossipSub delay=1s925ms309us773ns hinterval=700ms

As you see above, the node immediately picks up 22 peers at 19:53:49.

Additional context Nimbus 24.5.1 from official Docker image.

cheatfate commented 4 weeks ago

Could you please provide execution arguments of nimbus_beacon_node binary?

catwith1hat commented 4 weeks ago

/home/user/nimbus_beacon_node --network=holesky --jwt-secret=/jwt.hex \
  --udp-port=${PORT} --tcp-port=${PORT} --data-dir=/nimbus-data \
  --el=http://localhost:8640 --enr-auto-update --metrics --metrics-port=11140 \
  --metrics-address=0.0.0.0 --rest --rest-port=4040 --rest-address=0.0.0.0 \
  --suggested-fee-recipient=${ADDR} --doppelganger-detection=off \
  --history=prune --web3-signer-update-interval=300 \
  --in-process-validators=false --payload-builder=true \
  --payload-builder-url=http://localhost:18920

cheatfate commented 3 weeks ago

Could you please confirm that you are using latest release version of nimbus-eth2 - 24.5.1 and you have not specified --listen-address CLI option.

kdeme commented 3 weeks ago

@catwith1hat Do you also have the full log line for:

WRN 2024-05-29 18:00:03.718+00:00 Peer count low, no new peers discovered  >

?

Normally this log line should also print the discovered_nodes, new_peers, current_peers and wanted_peers.

catwith1hat commented 3 weeks ago

@cheatfate I can confirm that I haven't set the --listen-address CLI option:

$ ps axu | grep beacon | grep listen-add | wc
      0       0       0

@kdeme: Sorry, I truncated the line while copying it. Here is the full line:

WRN 2024-05-29 18:00:03.718+00:00 Peer count low, no new peers discovered    topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160

kdeme commented 3 weeks ago

Sorry, I truncated the line while copying it. Here is the full line:

Thank you, that's very useful.

In terms of reproducing this: Do I understand correctly that you are running a Nimbus Docker container inside a QEMU VM?

edit: As I cannot reproduce this myself, it would be really good to know the exact setup, as I think this will be some setup specific issue. I also think that the Discovery send failed msg="(101) Network is unreachable" log is not something that would show up when run from a Docker container, well, at least not in network-bridge mode.

Additional question, did all the Peer count low, no new peers discovered lines give discovered_nodes set to 0?

catwith1hat commented 3 weeks ago

In terms of reproducing this: Do I understand correctly that you are running a Nimbus Docker container inside a QEMU VM?

That's correct.

Additional question, did all the Peer count low, no new peers discovered lines give discovered_nodes set to 0?

Pretty much:

$ journalctl -u podman-nimbus-N4-I0.service --since="2024-05-29 16:00:00" --until="2024-05-29 20:00:00" | \
  grep -oE "Peer count low.*" | \
  awk '
$0 != prev {
    if (count > 1) {
        print count "x" prev
    }
    prev = $0
    count = 1
}
$0 == prev {
    count++
}
END {
    if (count > 1) {
        print count "x" prev
    }
}'
2xPeer count low, no new peers discovered    topics="networking" discovered_nodes=0 new_peers=@[] current_peers=36 wanted_peers=160
2xPeer count low, no new peers discovered    topics="networking" discovered_nodes=0 new_peers=@[] current_peers=35 wanted_peers=160
2xPeer count low, no new peers discovered    topics="networking" discovered_nodes=1 new_peers=@[] current_peers=0 wanted_peers=160
5xPeer count low, no new peers discovered    topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160
4xPeer count low, no new peers discovered    topics="networking" discovered_nodes=1 new_peers=@[] current_peers=0 wanted_peers=160
3xPeer count low, no new peers discovered    topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160
4xPeer count low, no new peers discovered    topics="networking" discovered_nodes=1 new_peers=@[] current_peers=0 wanted_peers=160
5xPeer count low, no new peers discovered    topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160
2xPeer count low, no new peers discovered    topics="networking" discovered_nodes=1 new_peers=@[] current_peers=0 wanted_peers=160
29xPeer count low, no new peers discovered    topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160

(the awk script counts how many times a line repeats)

As I cannot reproduce this myself, it would be really good to know the exact setup, as I think this will be some setup specific issue. I also think that the Discovery send failed msg="(101) Network is unreachable" log is not something that would show up when run from a Docker container, well, at least not in network-bridge mode.

This is probably correct that you would not get Network is unreachable. It is true that my networking inside the docker container is non-standard. I use my personal equivalent to gluetun that sets up VPN networking inside a container, and the nimbus container attaches to this docker's network. When I cut the link, the default route for Nimbus disappears after openvpn dies. However, the networking comes back, as well as a new default route. So I do believe that getting stuck is still an undesirable behavior of Nimbus.

But spotting Network is unreachable might be a great catch. Hypothesis: Maybe Nimbus reacts differently to a socket error that returns "Network unreachable" instead of a connection that simply times out? Maybe Nimbus treats the former as a more permanent error when connecting to a peer, such that this peer will never be tried again in the future?

@kdeme If you have your setup still at hand, would you mind trying to reproduce this by remove the default route of the docker container (or the whole VM/host) for let's say 60 minutes?

catwith1hat commented 2 weeks ago

Two more datapoints:

I tried to reproduce this on a non-VM machine by removing the default route inside the Nimbus docker container. In this case, the Nimbus mainnet instance came back to life.
On my holesky VM-hosted node, this is reproducable. The node permanently stays stuck without 0 peers after I restore network connectivity to the VM. I also double checked that the Nimbus container has indeed network connectivity, podman exec -it nimbus /bin/bash and then using bash's built in TCP support to connect to some random website.

status-im / nimbus-eth2

Beacon node's P2P degrades permanently after 40 minutes of no connectivity #6323