Open jajaislanina opened 4 months ago
Can you share your beacon node logs? You can send on our discord, I'm @pawan
on the sigp discord
Logs shared via Discord.
Experienced another similar issue on another node - this time managed to capture all logs. Sent via PM to @pawanjay176 on Discord.
What was the resolution on this issue? I have the same thing happening consistently with lighthouse+reth combo running in k8s. Everything works initially with both consensus and execution getting peers without issues. But then lighthouse starts loosing peers over time. I guess it just never gets new peers while old ones disconnect naturally over time.
Restarting lighthouse container does not seem to help, restarting reth alone does not fix this either. But restarting both seems to fix the issue.
They have different discovery ports configured. My lighthouse is configured as so:
lighthouse bn --http --http-address=0.0.0.0 --execution-endpoint=http://localhost:8551
--logfile-debug-level debug --port 9000 --enable-private-discovery --metrics
--metrics-address=0.0.0.0 --execution-jwt=/config/jwt-secret.txt --disable-deposit-contract-sync
--checkpoint-sync-url=https://checkpoint-sync.sepolia.ethpandaops.io/ --disable-backfill-rate-limiting
--network=sepolia --datadir=/data --network-dir=/tmp --disable-upnp --execution-timeout-multiplier=1
--disable-lock-timeouts
i have confirmed with netcat that ports 9000 and 9001 are listening and accepting external connections
Never found the solution to this. I am running Lighthouse+Geth in the same pod and have added a liveness probe that kills both containers if peer count on LH is <4 for longer than 60 minutes. What we found out was that there were failures (timeout) to dial peers without apparent root cause.
@jajaislanina That does sound strange. Please let us know if it continues in 5.2, as we've fixed a few sync & lookup bugs. Sounds like the dialing issue is unrelated to those fixes though
Will update in a few days. Currently upgrading Holesky nodes to 5.2.0 for the memory footprint (right now we have weird spikes over 40GB of memory and 15vCPU cores when the node is lagging. Hopefully this also helps with the peer retention.
Hi @michaelsproul
Just had one of the Sepolia nodes that was on version 5.2.0 of Lighthouse experience sync issues. When we checked peer count was 5 and has been declining over last few days. Note that one node is fine and the other one (light blue) starts loosing peers
Description
Over the period of 2 days Lighthouse peer count goes from ~70 to 0 which stops the sync.
Version
Docker image version 4.6.0
Present Behaviour
This happens few times per week. I am running multiple Ethereum Mainnet, Sepolia and Holesky nodes and this happens mostly on Mainnet and Sepolia. When the peer count drops to single digits - node stops syncing. Restarting the node fixes the issue. Nothing in the logs screams at me that should be relevant to this issue.
I am assuming that somehow my nodes are flagged as "bad" and over time get blacklisted by other nodes in the network - but i have no proof or means to confirm this.
Expected Behaviour
I would expect the peer count to remain stable over time and for Lighthouse to re-connect to peers - basically not allow the count to drop to 0.
Steps to resolve
Restart is a temporary mitigation.![image](https://github.com/sigp/lighthouse/assets/3975615/459f85aa-6348-4a71-9ce9-f3230f176473)