status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 5 forks source link

Fix sync of beacon nodes on MacOS Holesky host #201

Open jakubgs opened 2 months ago

jakubgs commented 2 months ago

Currently two nodes on macm2-01.ih-eu-mda1.nimbus.holesky are failing the BeaconChainNoAggregatesReceived metrics check:

image

And the API does confirm that:

jakubgs@macm2-01.ih-eu-mda1.nimbus.holesky:~ % for port in $(seq 9301 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"979007","sync_distance":"1627664","is_syncing":true,"is_optimistic":true,"el_offline":true}}
{"data":{"head_slot":"1400927","sync_distance":"1205744","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"2606671","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
jakubgs commented 2 months ago

If we look at stable node on macm2-01.ih-eu-mda1.nimbus.holesky we can see that the sync distance is just growing:

image

The testing node has a similar pattern:

image

While unstable appears to be fine:

image

jakubgs commented 2 months ago

The interesting part is that both stable:

image

And testing:

image

Are unable to get to max peers, while unstable gets there without issues:

image

jakubgs commented 2 months ago

But LibP2P ports for all 3 nodes are correctly open on the firewall:

 > nmap -Pn -p9001-9003 macm2-01.ih-eu-mda1.nimbus.holesky.status.im
Starting Nmap 7.94 ( https://nmap.org ) at 2024-09-24 15:13 CEST
Nmap scan report for macm2-01.ih-eu-mda1.nimbus.holesky.status.im (185.181.230.76)
Host is up (0.057s latency).
rDNS record for 185.181.230.76: no-rdns.innovahosting.net

PORT     STATE SERVICE
9001/tcp open  tor-orport
9002/tcp open  dynamid
9003/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.12 seconds

So what is different about stable and testing that makes connectivity so bad?

jakubgs commented 2 months ago

What's also interesting is that their respective Geth nodes on `` have different matching state too:

jakubgs@holesky-01.ih-eu-mda1.nimbus.geth:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":{"currentBlock":"0x0","healedBytecodeBytes":"0x0","healedBytecodes":"0x0","healedTrienodeBytes":"0x0","healedTrienodes":"0x0","healingBytecode":"0x0","healingTrienodes":"0x0","highestBlock":"0x0","startingBlock":"0x0","syncedAccountBytes":"0x0","syncedAccounts":"0x0","syncedBytecodeBytes":"0x0","syncedBytecodes":"0x0","syncedStorage":"0x0","syncedStorageBytes":"0x0","txIndexFinishedBlocks":"0x0","txIndexRemainingBlocks":"0x1"}}

The Geth node for unstable is not synced, while ones for stable and testing appear to be synced:

jakubgs@holesky-01.ih-eu-mda1.nimbus.geth:~ % sudo du -hsc /docker/*
131G    /docker/geth-holesky-01
150G    /docker/geth-holesky-02
4.4G    /docker/geth-holesky-03
284G    total

This is very confusing.

jakubgs commented 2 months ago

I found some EL connection failures in stable node logs:

{"lvl":"DBG","ts":"2024-09-24 16:30:46.299+03:00","msg":"Sending message to RPC server","topics":"JSONRPC-HTTP-CLIENT","address":{"oResultPrivate":true,"vResultPrivate":{"id":"holesky-01.ih-eu-mda1.nimbus.geth.wg:8551","scheme":"NonSecure","hostname":"holesky-01.ih-eu-mda1.nimbus.geth.wg","port":8551,"path":"","query":"","anchor":"","username":"","password":"","addresses":[{"family":"IPv4","address_v4":[10,15,0,12],"port":8551}]}},"msg_len":342,"name":"engine_forkchoiceUpdatedV1"}
{"lvl":"DBG","ts":"2024-09-24 16:30:46.300+03:00","msg":"Cancelled POST Request with JSON-RPC","topics":"JSONRPC-HTTP-CLIENT","e":"Future operation cancelled!"}

And indeed the port is closed:

jakubgs@macm2-01.ih-eu-mda1.nimbus.holesky:~ % sudo nmap -Pn -p8551-8553 holesky-01.ih-eu-mda1.nimbus.geth.wg
Starting Nmap 7.95 ( https://nmap.org ) at 2024-09-24 16:31 EEST
Nmap scan report for holesky-01.ih-eu-mda1.nimbus.geth.wg (10.15.0.12)
Host is up (0.00085s latency).

PORT     STATE  SERVICE
8551/tcp closed unknown
8552/tcp open   unknown
8553/tcp open   unknown

Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds

But the one for testing node is open, so that doesn't explain low peers at all.

jakubgs commented 2 months ago

The geth-holesky-01-node logs on holesky-01.ih-eu-mda1.nimbus.geth show a lot of these:

WARN [09-24|13:30:34.987] Beacon client online, but no consensus updates received in a while. Please fix your beacon client to follow the chain!

But that would make sense since that BN is not actually synced.

jakubgs commented 2 months ago

Oh, I see, the port start is wrong, it should start at 8552:

jakubgs@macm2-01.ih-eu-mda1.nimbus.holesky:~ % sudo nmap -Pn -p8552-8554 holesky-01.ih-eu-mda1.nimbus.geth.wg
Starting Nmap 7.95 ( https://nmap.org ) at 2024-09-24 16:36 EEST
Nmap scan report for holesky-01.ih-eu-mda1.nimbus.geth.wg (10.15.0.12)
Host is up (0.00079s latency).

PORT     STATE SERVICE
8552/tcp open  unknown
8553/tcp open  unknown
8554/tcp open  rtsp-alt

Nmap done: 1 IP address (1 host up) scanned in 0.02 seconds

This would also explain why last Geth node is not synced. Since it has no BN connected to it.

jakubgs commented 2 months ago

Fixed Geth ports:

jakubgs commented 2 months ago

This drop in peers of stable node is bizzare:

image

It was doing okay and then it crashes and burns:

image

jakubgs commented 2 months ago

There's not a single error in logs, just some warning:

nimbus@macm2-01.ih-eu-mda1.nimbus.holesky:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep WRN
{"lvl":"WRN","ts":"2024-09-25 07:31:20.367+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":3,"new_peers":[],"current_peers":80,"wanted_peers":320}
{"lvl":"DBG","ts":"2024-09-25 07:35:24.368+03:00","msg":"Peer connection upgraded","topics":"networking","peer":"16Uiu2HAmRnBjeA9DxxWRNno3kx7bFZuoXaB9QXrpf5kZS32P6LWn","connections":1}
{"lvl":"DBG","ts":"2024-09-25 07:35:25.524+03:00","msg":"Peer disconnected","topics":"networking","peer":"16Uiu2HAmRnBjeA9DxxWRNno3kx7bFZuoXaB9QXrpf5kZS32P6LWn","connections":0}
{"lvl":"WRN","ts":"2024-09-25 07:35:27.286+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":0,"new_peers":[],"current_peers":80,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:35:55.855+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":1,"new_peers":[],"current_peers":79,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:36:07.224+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":0,"new_peers":[],"current_peers":76,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:36:34.411+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":5,"new_peers":[],"current_peers":78,"wanted_peers":320}

But the disconnects are higher then number of successful connections:

nimbus@macm2-01.ih-eu-mda1.nimbus.holesky:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep 'Lost connection to peer' | wc -l
     255

nimbus@macm2-01.ih-eu-mda1.nimbus.holesky:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep 'Peer successfully connected' | wc -l
     133
jakubgs commented 2 months ago

At the same time the unstable node had absolutely no issues with connectivity on the same host:

image

If network issues were the problem this node would have them too, but it doesn't.

jakubgs commented 2 months ago

Here's the full log at that time, I can't find anything relevant: service.20240925_0747.log.gz

jakubgs commented 2 months ago

I have purged testing node as stable appears to be syncing better than before:

image

jakubgs commented 1 month ago

As we can see from this graph the purging of stable node has not helped it sync:

Image

jakubgs commented 1 month ago

This is the most bizzarre part, today suddenly around 5am the stable node just lost a bunch of peers for no clear reason:

Image

Kinda looks like a bug to me. But during the same time testing node was struggling to get any reasonable amount of peers:

Image

And unstable was totally unaffected:

Image

https://grafana.infra.status.im/d/pgeNfj2Wz23/nimbus-beacon-node?orgId=1&var-instance=macm2-01.ih-eu-mda1.nimbus.holesky&var-container=beacon-node-holesky-stable&from=1728425848167&to=1728469048167&query=fluf

jakubgs commented 1 month ago

Here are logs from that time from the stable node: service.20241009_0515.log.gz

jakubgs commented 1 month ago

What the fuck is happening here?

Image

jakubgs commented 1 month ago

The most bizzarre part must be that incoming traffic at the same time drops to nearly zero:

Image

While outgoing goes up matching the raise in peers:

Image

How does any of this make sense?