sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.94k stars 746 forks source link

error: "rtnetlink socket closed" #5428

Open wjmelements opened 8 months ago

wjmelements commented 8 months ago

Description

Lighthouse is not connecting to geth (v1.13.12) and floods the logs with error messages about the rtnetlink socket being closed

Version

stable v5.1.1

Present Behaviour

The logs is only this error, repeated

WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p

It does not recover. Lighthouse only spams this log repeatedly.

Geth config:

--authrpc.addr localhost --authrpc.port 8551 --authrpc.jwtsecret /shared_ebs/geth_data/geth/jwtsecret

Lighthouse config:

--execution-endpoint http://localhost:8551 --execution-jwt /shared_ebs/geth_data/geth/jwtsecret

Expected Behaviour

Healthy node

Steps to resolve

Unknown

michaelsproul commented 8 months ago

Try killing Lighthouse with kill -9 or similar, then restart it.

I think this is a bug in libp2p where the listener gets interrupted by a shutdown signal during startup and then goes into an infinite loop. See:

Related:

If Lighthouse doesn't restart after the kill there might be something else that triggered the initial shutdown, e.g. low disk space.

wjmelements commented 8 months ago

then restart it

The issue happens again on restart. I am using systemd to manage the service which does sigkill if sigint doesn't stop it after long enough.

low disk space

The disk has 3TB of extra space.

I'm going to try to get the logs from when it starts happening next time, but it is challenging because the log-flood eventually cycles those logs.

michaelsproul commented 8 months ago

If you could DM some debug logs from $datadir/beacon/logs then I'll have a look at why it might be shutting down

michaelsproul commented 8 months ago

I can't glean much from the debug logs, it just shows Lighthouse eventually shutting down due to the SIGTERM and then the issue being resolved after the restart:

Mar 18 20:28:26.757 INFO Shutting down.., reason: Success("Received SIGTERM"), module: lighthouse:696

Mar 18 20:28:42.133 INFO Lighthouse started, version: Lighthouse/v5.1.1-2a3c709, module: lighthouse:621

The debug logs filled up with warnings pretty quickly, so the 5 files only covered:

So there's only ~1 minute of history covered by the logs.

@AgeManning It looks like we expect some listener errors:

https://github.com/sigp/lighthouse/blob/5ce16192c769e4b593d7f953edcfb6e548c84167/beacon_node/lighthouse_network/src/service/mod.rs#L1793-L1801

I guess we can't change that event handler to shutdown immediately because some listener errors are non-fatal? e.g. listening on 0.0.0.0 causes LH to try listening on some random interface that doesn't work, if we have other interfaces we can safely ignore this sort of failure.

wjmelements commented 7 months ago

It happened again but I don't have logs from when it started because it produces 1 million lines of errorlogs every 2 minutes and so my logs cycle out before I even notice it went down.

michaelsproul commented 7 months ago

@wjmelements Do you want to try modifying Lighthouse so it just crashes on the lines I linked above? If you replace the current warning by panic!() it will at least kill things quickly without spamming the logs and we might be able to see what's going on a bit better

AgeManning commented 7 months ago

Yeah. So the reason we tolerate the listen errors is that generally libp2p will try and listen on a number of interfaces. i.e: ip4 tcp ip6 tcp ip4 udp (quic) ip6 udp (quic)

The host might not have ip6 support or some tcp port failure. Each individual listen error should not be a catastrophic failure, as we can fail on ip6 but succeed on ip4 (a common case).

I'll reach out to Michael and try and have a look at the logs. As of 5.1 we now store libp2p logs in the .lighthouse//beacon_node/logs directory. There might be useful information there.

AgeManning commented 7 months ago

Also including @jxs who may be interested in this

ackintosh commented 7 months ago

@wjmelements It might be helpful to check your system logs (e.g. /var/log/messages) for any errors related to network interfaces.

The rtnetlink socket closed error is emmitted by if-watch (linux.rs), a component that allows users to manage network interface events, such as bringing interfaces up or down. rtnetlink serves a mechanism for retrieving network information from the kernel, facilitated through a specific socket communucation channel between the kernel and user programs. This error could indicate that an issue has arisen within the socket communication to the kernel.

wjmelements commented 7 months ago

I have logs from where it starts, on holesky:

Apr 08 18:17:06.001 INFO Synced                                  slot: 1391485, block:    …  empty, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x89e3…f847 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:14.353 INFO New block received                      root: 0x09d30077b578042c8ce3a2f1a39e5a191544dcf25f1aa45dfd719f2b1aab0236, slot: 1391486
Apr 08 18:17:18.001 INFO Synced                                  slot: 1391486, block: 0x09d3…0236, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:30.004 INFO Synced                                  slot: 1391487, block:    …  empty, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:42.316 INFO Synced                                  slot: 1391488, block:    …  empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:53.266 WARN Snapshot cache miss for blob verification, index: 4, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.348 WARN Snapshot cache miss for blob verification, index: 5, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.733 WARN Snapshot cache miss for blob verification, index: 1, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.823 WARN Snapshot cache miss for blob verification, index: 2, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:54.000 INFO Synced                                  slot: 1391489, block:    …  empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:55.251 WARN Snapshot cache miss for blob verification, index: 0, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:18:51.433 INFO New block received                      root: 0x08b5a63f9c883498a8fc00628bb5bd0390d2715cbe068ffeea9c26fdabf365b1, slot: 1391489
Apr 08 18:19:26.923 INFO Synced                                  slot: 1391497, block:    …  empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:19:26.928 WARN Execution engine call failed            error: HttpClient(url: http://localhost:6551/, kind: timeout, detail: operation timed out), service: exec
Apr 08 18:19:26.932 WARN Error whilst processing payload status  error: Api { error: HttpClient(url: http://localhost:6551/, kind: timeout, detail: operation timed out) }, service: exec
Apr 08 18:19:27.820 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Left(Other(Left(Left(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })))) }) })) }, service: libp2p
Apr 08 18:19:27.825 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p

after which the same log repeats, mostly Right with the occasional Left. The Left here is shown between two rights.

Apr 08 18:19:27.872 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p
Apr 08 18:19:27.924 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Left(Other(Left(Left(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })))) }) })) }, service: libp2p
Apr 08 18:19:27.934 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p
ackintosh commented 7 months ago

FYI: I have submitted a PR to if-watch to improve the rtnetlink socket closed error message by including specific reasons. This should help us to understand why the error occurs.

https://github.com/mxinden/if-watch/pull/39