sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.91k stars 738 forks source link

Event monitor connections dropped by Lighthouse, mysteriously #3878

Open broadbentg opened 1 year ago

broadbentg commented 1 year ago

Description

Events are monitored with this API: https://ethereum.github.io/beacon-APIs/?urls.primaryName=v1#/Events

A connection is established and events are monitored. However, after some time ranging from 48 minutes to 26 hours, Lighthouse terminates the connection. For example:

curl -X 'GET' 'http://localhost:5052/eth/v1/events?topics=block,head,attestation,voluntary_exit,finalized_checkpoint,chain_reorg' -H 'accept: text/event-stream' > /dev/null

failed after a little more than two hours. A packet capture with tcpdump shows that Lighthouse terminates the connection:

02:08:51.967504 IP 127.0.0.1.5052 > 127.0.0.1.33682: Flags [P.], seq 1259690769:1259692677, ack 1, win 512, options [nop,nop,TS val 2577486793 ecr 2577486792], length 1908 02:08:51.967530 IP 127.0.0.1.33682 > 127.0.0.1.5052: Flags [.], ack 1259692677, win 500, options [nop,nop,TS val 2577486793 ecr 2577486793], length 0 02:08:51.983931 IP 127.0.0.1.5052 > 127.0.0.1.33682: Flags [F.], seq 1259692677, ack 1, win 512, options [nop,nop,TS val 2577486810 ecr 2577486793], length 0 02:08:51.987471 IP 127.0.0.1.33682 > 127.0.0.1.5052: Flags [F.], seq 1, ack 1259692678, win 512, options [nop,nop,TS val 2577486813 ecr 2577486810], length 0 02:08:51.987498 IP 127.0.0.1.5052 > 127.0.0.1.33682: Flags [.], ack 2, win 512, options [nop,nop,TS val 2577486813 ecr 2577486813], length 0

This has been seen repeatedly (including packet captures) with curl, some C test code, and Python. In each case Gnosis Chain was used. Several times three test processes were run simultaneously. They are not cut off at the same time, it appears to happen randomly. No related log file messages were found.

Version

The problem has been seen with two pre-built binary x86 versions: Lighthouse v3.3.0-bf533c8 BLS library: blst-portable SHA256 hardware acceleration: false Specs: mainnet (true), minimal (false), gnosis (true)

Lighthouse v3.4.0-38514c0 BLS library: blst-modern SHA256 hardware acceleration: false Specs: mainnet (true), minimal (false), gnosis (true)

The problem is seen under Ubuntu on x86_64 virtual hardware: Distributor ID: Ubuntu Description: Ubuntu 22.10 Release: 22.10 Codename: kinetic

Present Behaviour

Event monitor TCP connections do not stay up indefinitely.

Expected Behaviour

Event monitor TCP connections should stay up until the client terminates them.

Steps to resolve

The simple work around is to simply re-open the TCP connection and continue monitoring events. However, this could result in an event being missed.

broadbentg commented 1 year ago

The rate of connection drops seems to be much higher when backfilling states using --reconstruct-historic-states. Perhaps because of increased net traffic?

michaelsproul commented 1 year ago

The rate of connection drops seems to be much higher when backfilling states using --reconstruct-historic-states

There's no network traffic required for reconstructing historic states, but will impose some disk and CPU load