Open jakubgs opened 2 months ago
If we look at stable
node on macm2-01.ih-eu-mda1.nimbus.holesky
we can see that the sync distance is just growing:
The testing
node has a similar pattern:
While unstable
appears to be fine:
The interesting part is that both stable
:
And testing
:
Are unable to get to max peers, while unstable
gets there without issues:
But LibP2P ports for all 3 nodes are correctly open on the firewall:
> nmap -Pn -p9001-9003 macm2-01.ih-eu-mda1.nimbus.holesky.status.im
Starting Nmap 7.94 ( https://nmap.org ) at 2024-09-24 15:13 CEST
Nmap scan report for macm2-01.ih-eu-mda1.nimbus.holesky.status.im (185.181.230.76)
Host is up (0.057s latency).
rDNS record for 185.181.230.76: no-rdns.innovahosting.net
PORT STATE SERVICE
9001/tcp open tor-orport
9002/tcp open dynamid
9003/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 0.12 seconds
So what is different about stable
and testing
that makes connectivity so bad?
What's also interesting is that their respective Geth nodes on `` have different matching state too:
jakubgs@holesky-01.ih-eu-mda1.nimbus.geth:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":{"currentBlock":"0x0","healedBytecodeBytes":"0x0","healedBytecodes":"0x0","healedTrienodeBytes":"0x0","healedTrienodes":"0x0","healingBytecode":"0x0","healingTrienodes":"0x0","highestBlock":"0x0","startingBlock":"0x0","syncedAccountBytes":"0x0","syncedAccounts":"0x0","syncedBytecodeBytes":"0x0","syncedBytecodes":"0x0","syncedStorage":"0x0","syncedStorageBytes":"0x0","txIndexFinishedBlocks":"0x0","txIndexRemainingBlocks":"0x1"}}
The Geth node for unstable
is not synced, while ones for stable
and testing
appear to be synced:
jakubgs@holesky-01.ih-eu-mda1.nimbus.geth:~ % sudo du -hsc /docker/*
131G /docker/geth-holesky-01
150G /docker/geth-holesky-02
4.4G /docker/geth-holesky-03
284G total
This is very confusing.
I found some EL connection failures in stable
node logs:
{"lvl":"DBG","ts":"2024-09-24 16:30:46.299+03:00","msg":"Sending message to RPC server","topics":"JSONRPC-HTTP-CLIENT","address":{"oResultPrivate":true,"vResultPrivate":{"id":"holesky-01.ih-eu-mda1.nimbus.geth.wg:8551","scheme":"NonSecure","hostname":"holesky-01.ih-eu-mda1.nimbus.geth.wg","port":8551,"path":"","query":"","anchor":"","username":"","password":"","addresses":[{"family":"IPv4","address_v4":[10,15,0,12],"port":8551}]}},"msg_len":342,"name":"engine_forkchoiceUpdatedV1"}
{"lvl":"DBG","ts":"2024-09-24 16:30:46.300+03:00","msg":"Cancelled POST Request with JSON-RPC","topics":"JSONRPC-HTTP-CLIENT","e":"Future operation cancelled!"}
And indeed the port is closed:
jakubgs@macm2-01.ih-eu-mda1.nimbus.holesky:~ % sudo nmap -Pn -p8551-8553 holesky-01.ih-eu-mda1.nimbus.geth.wg
Starting Nmap 7.95 ( https://nmap.org ) at 2024-09-24 16:31 EEST
Nmap scan report for holesky-01.ih-eu-mda1.nimbus.geth.wg (10.15.0.12)
Host is up (0.00085s latency).
PORT STATE SERVICE
8551/tcp closed unknown
8552/tcp open unknown
8553/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds
But the one for testing
node is open, so that doesn't explain low peers at all.
The geth-holesky-01-node
logs on holesky-01.ih-eu-mda1.nimbus.geth
show a lot of these:
WARN [09-24|13:30:34.987] Beacon client online, but no consensus updates received in a while. Please fix your beacon client to follow the chain!
But that would make sense since that BN is not actually synced.
Oh, I see, the port start is wrong, it should start at 8552
:
jakubgs@macm2-01.ih-eu-mda1.nimbus.holesky:~ % sudo nmap -Pn -p8552-8554 holesky-01.ih-eu-mda1.nimbus.geth.wg
Starting Nmap 7.95 ( https://nmap.org ) at 2024-09-24 16:36 EEST
Nmap scan report for holesky-01.ih-eu-mda1.nimbus.geth.wg (10.15.0.12)
Host is up (0.00079s latency).
PORT STATE SERVICE
8552/tcp open unknown
8553/tcp open unknown
8554/tcp open rtsp-alt
Nmap done: 1 IP address (1 host up) scanned in 0.02 seconds
This would also explain why last Geth node is not synced. Since it has no BN connected to it.
Fixed Geth ports:
infra-nimbus#126afac1
- holesky: fix Geth port numbers on macm2-01This drop in peers of stable
node is bizzare:
It was doing okay and then it crashes and burns:
There's not a single error in logs, just some warning:
nimbus@macm2-01.ih-eu-mda1.nimbus.holesky:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep WRN
{"lvl":"WRN","ts":"2024-09-25 07:31:20.367+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":3,"new_peers":[],"current_peers":80,"wanted_peers":320}
{"lvl":"DBG","ts":"2024-09-25 07:35:24.368+03:00","msg":"Peer connection upgraded","topics":"networking","peer":"16Uiu2HAmRnBjeA9DxxWRNno3kx7bFZuoXaB9QXrpf5kZS32P6LWn","connections":1}
{"lvl":"DBG","ts":"2024-09-25 07:35:25.524+03:00","msg":"Peer disconnected","topics":"networking","peer":"16Uiu2HAmRnBjeA9DxxWRNno3kx7bFZuoXaB9QXrpf5kZS32P6LWn","connections":0}
{"lvl":"WRN","ts":"2024-09-25 07:35:27.286+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":0,"new_peers":[],"current_peers":80,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:35:55.855+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":1,"new_peers":[],"current_peers":79,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:36:07.224+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":0,"new_peers":[],"current_peers":76,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:36:34.411+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":5,"new_peers":[],"current_peers":78,"wanted_peers":320}
But the disconnects are higher then number of successful connections:
nimbus@macm2-01.ih-eu-mda1.nimbus.holesky:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep 'Lost connection to peer' | wc -l
255
nimbus@macm2-01.ih-eu-mda1.nimbus.holesky:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep 'Peer successfully connected' | wc -l
133
At the same time the unstable
node had absolutely no issues with connectivity on the same host:
If network issues were the problem this node would have them too, but it doesn't.
Here's the full log at that time, I can't find anything relevant: service.20240925_0747.log.gz
I have purged testing
node as stable
appears to be syncing better than before:
As we can see from this graph the purging of stable
node has not helped it sync:
This is the most bizzarre part, today suddenly around 5am the stable
node just lost a bunch of peers for no clear reason:
Kinda looks like a bug to me. But during the same time testing
node was struggling to get any reasonable amount of peers:
And unstable
was totally unaffected:
Here are logs from that time from the stable
node: service.20241009_0515.log.gz
What the fuck is happening here?
The most bizzarre part must be that incoming traffic at the same time drops to nearly zero:
While outgoing goes up matching the raise in peers:
How does any of this make sense?
Currently two nodes on
macm2-01.ih-eu-mda1.nimbus.holesky
are failing theBeaconChainNoAggregatesReceived
metrics check:And the API does confirm that: