Open wjmelements opened 8 months ago
Try killing Lighthouse with kill -9
or similar, then restart it.
I think this is a bug in libp2p where the listener gets interrupted by a shutdown signal during startup and then goes into an infinite loop. See:
Related:
If Lighthouse doesn't restart after the kill
there might be something else that triggered the initial shutdown, e.g. low disk space.
then restart it
The issue happens again on restart. I am using systemd to manage the service which does sigkill if sigint doesn't stop it after long enough.
low disk space
The disk has 3TB of extra space.
I'm going to try to get the logs from when it starts happening next time, but it is challenging because the log-flood eventually cycles those logs.
If you could DM some debug logs from $datadir/beacon/logs
then I'll have a look at why it might be shutting down
I can't glean much from the debug logs, it just shows Lighthouse eventually shutting down due to the SIGTERM
and then the issue being resolved after the restart:
Mar 18 20:28:26.757 INFO Shutting down.., reason: Success("Received SIGTERM"), module: lighthouse:696
Mar 18 20:28:42.133 INFO Lighthouse started, version: Lighthouse/v5.1.1-2a3c709, module: lighthouse:621
The debug logs filled up with warnings pretty quickly, so the 5 files only covered:
Listener error
logs starting Mar 18 20:27:20.336So there's only ~1 minute of history covered by the logs.
@AgeManning It looks like we expect some listener errors:
I guess we can't change that event handler to shutdown immediately because some listener errors are non-fatal? e.g. listening on 0.0.0.0
causes LH to try listening on some random interface that doesn't work, if we have other interfaces we can safely ignore this sort of failure.
It happened again but I don't have logs from when it started because it produces 1 million lines of errorlogs every 2 minutes and so my logs cycle out before I even notice it went down.
@wjmelements Do you want to try modifying Lighthouse so it just crashes on the lines I linked above? If you replace the current warning by panic!()
it will at least kill things quickly without spamming the logs and we might be able to see what's going on a bit better
Yeah. So the reason we tolerate the listen errors is that generally libp2p will try and listen on a number of interfaces. i.e: ip4 tcp ip6 tcp ip4 udp (quic) ip6 udp (quic)
The host might not have ip6 support or some tcp port failure. Each individual listen error should not be a catastrophic failure, as we can fail on ip6 but succeed on ip4 (a common case).
I'll reach out to Michael and try and have a look at the logs. As of 5.1 we now store libp2p logs in the .lighthouse/
Also including @jxs who may be interested in this
@wjmelements It might be helpful to check your system logs (e.g. /var/log/messages
) for any errors related to network interfaces.
The rtnetlink socket closed
error is emmitted by if-watch (linux.rs), a component that allows users to manage network interface events, such as bringing interfaces up or down. rtnetlink
serves a mechanism for retrieving network information from the kernel, facilitated through a specific socket communucation channel between the kernel and user programs. This error could indicate that an issue has arisen within the socket communication to the kernel.
I have logs from where it starts, on holesky:
Apr 08 18:17:06.001 INFO Synced slot: 1391485, block: … empty, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x89e3…f847 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:14.353 INFO New block received root: 0x09d30077b578042c8ce3a2f1a39e5a191544dcf25f1aa45dfd719f2b1aab0236, slot: 1391486
Apr 08 18:17:18.001 INFO Synced slot: 1391486, block: 0x09d3…0236, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:30.004 INFO Synced slot: 1391487, block: … empty, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:42.316 INFO Synced slot: 1391488, block: … empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:53.266 WARN Snapshot cache miss for blob verification, index: 4, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.348 WARN Snapshot cache miss for blob verification, index: 5, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.733 WARN Snapshot cache miss for blob verification, index: 1, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.823 WARN Snapshot cache miss for blob verification, index: 2, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:54.000 INFO Synced slot: 1391489, block: … empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:55.251 WARN Snapshot cache miss for blob verification, index: 0, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:18:51.433 INFO New block received root: 0x08b5a63f9c883498a8fc00628bb5bd0390d2715cbe068ffeea9c26fdabf365b1, slot: 1391489
Apr 08 18:19:26.923 INFO Synced slot: 1391497, block: … empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:19:26.928 WARN Execution engine call failed error: HttpClient(url: http://localhost:6551/, kind: timeout, detail: operation timed out), service: exec
Apr 08 18:19:26.932 WARN Error whilst processing payload status error: Api { error: HttpClient(url: http://localhost:6551/, kind: timeout, detail: operation timed out) }, service: exec
Apr 08 18:19:27.820 WARN Listener error error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Left(Other(Left(Left(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })))) }) })) }, service: libp2p
Apr 08 18:19:27.825 WARN Listener error error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p
after which the same log repeats, mostly Right with the occasional Left. The Left here is shown between two rights.
Apr 08 18:19:27.872 WARN Listener error error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p
Apr 08 18:19:27.924 WARN Listener error error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Left(Other(Left(Left(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })))) }) })) }, service: libp2p
Apr 08 18:19:27.934 WARN Listener error error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p
Description
Lighthouse is not connecting to geth (v1.13.12) and floods the logs with error messages about the rtnetlink socket being closed
Version
stable v5.1.1
Present Behaviour
The logs is only this error, repeated
It does not recover. Lighthouse only spams this log repeatedly.
Geth config:
Lighthouse config:
Expected Behaviour
Healthy node
Steps to resolve
Unknown