sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.81k stars 696 forks source link

Peer loss over time #5271

Open jajaislanina opened 4 months ago

jajaislanina commented 4 months ago

Description

Over the period of 2 days Lighthouse peer count goes from ~70 to 0 which stops the sync.

Version

Docker image version 4.6.0

Present Behaviour

This happens few times per week. I am running multiple Ethereum Mainnet, Sepolia and Holesky nodes and this happens mostly on Mainnet and Sepolia. When the peer count drops to single digits - node stops syncing. Restarting the node fixes the issue. Nothing in the logs screams at me that should be relevant to this issue.

I am assuming that somehow my nodes are flagged as "bad" and over time get blacklisted by other nodes in the network - but i have no proof or means to confirm this.

Expected Behaviour

I would expect the peer count to remain stable over time and for Lighthouse to re-connect to peers - basically not allow the count to drop to 0.

Steps to resolve

Restart is a temporary mitigation. image

pawanjay176 commented 4 months ago

Can you share your beacon node logs? You can send on our discord, I'm @pawan on the sigp discord

jajaislanina commented 4 months ago

Logs shared via Discord.

jajaislanina commented 4 months ago

Experienced another similar issue on another node - this time managed to capture all logs. Sent via PM to @pawanjay176 on Discord.

demon-xxi commented 3 weeks ago

What was the resolution on this issue? I have the same thing happening consistently with lighthouse+reth combo running in k8s. Everything works initially with both consensus and execution getting peers without issues. But then lighthouse starts loosing peers over time. I guess it just never gets new peers while old ones disconnect naturally over time.

Restarting lighthouse container does not seem to help, restarting reth alone does not fix this either. But restarting both seems to fix the issue.

They have different discovery ports configured. My lighthouse is configured as so:

 lighthouse bn --http --http-address=0.0.0.0 --execution-endpoint=http://localhost:8551
      --logfile-debug-level debug --port 9000 --enable-private-discovery --metrics
      --metrics-address=0.0.0.0 --execution-jwt=/config/jwt-secret.txt --disable-deposit-contract-sync
      --checkpoint-sync-url=https://checkpoint-sync.sepolia.ethpandaops.io/ --disable-backfill-rate-limiting
      --network=sepolia --datadir=/data --network-dir=/tmp --disable-upnp --execution-timeout-multiplier=1
      --disable-lock-timeouts

i have confirmed with netcat that ports 9000 and 9001 are listening and accepting external connections

jajaislanina commented 3 weeks ago

Never found the solution to this. I am running Lighthouse+Geth in the same pod and have added a liveness probe that kills both containers if peer count on LH is <4 for longer than 60 minutes. What we found out was that there were failures (timeout) to dial peers without apparent root cause.

michaelsproul commented 2 weeks ago

@jajaislanina That does sound strange. Please let us know if it continues in 5.2, as we've fixed a few sync & lookup bugs. Sounds like the dialing issue is unrelated to those fixes though

jajaislanina commented 2 weeks ago

Will update in a few days. Currently upgrading Holesky nodes to 5.2.0 for the memory footprint (right now we have weird spikes over 40GB of memory and 15vCPU cores when the node is lagging. Hopefully this also helps with the peer retention.

jajaislanina commented 2 weeks ago

Hi @michaelsproul

Just had one of the Sepolia nodes that was on version 5.2.0 of Lighthouse experience sync issues. When we checked peer count was 5 and has been declining over last few days. Note that one node is fine and the other one (light blue) starts loosing peers

image