Closed jakubgs closed 1 year ago
Here's the current state as far as I can tell:
The issue seems to be in https://github.com/status-im/nimbus-eth2/commit/8b3ffec0.
I've tested https://github.com/status-im/nimbus-eth2/pull/4840 and it appears to fix the issue:
admin@linux-01.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9304 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"6257820","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6257820","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
admin@linux-01.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9304 9305); do c 0:$port/eth/v1/node/version | jq -c; done
{"data":{"version":"Nimbus/v23.3.2-75be7d-stateofus"}}
{"data":{"version":"Nimbus/v23.3.2-751d9d-stateofus"}}
For some reason linux-05
geth synced in record time of under 24 hours:
admin@linux-05.ih-eu-mda1.nimbus.mainnet:~ % /docker/geth-mainnet/rpc.sh eth_syncing
{
"jsonrpc": "2.0",
"id": 1,
"result": false
}
admin@linux-05.ih-eu-mda1.nimbus.mainnet:~ % sudo du -hs /docker/geth-mainnet
804G /docker/geth-mainnet
What I don't get is why some graphs on nodes look like this:
beacon-node-mainnet-unstable-01@linux-05.ih-eu-mda1.nimbus.mainnet
beacon-node-mainnet-unstable-02@linux-05.ih-eu-mda1.nimbus.mainnet
While the other unstable
node on the ame host looks like this. Why is there such a big difference?
Two geth nodes are still not synced:
It's linux-02
and linux-04
that needs some more time.
I made a bad decision to sync linux-01
from metal-01
, and it's still going:
All nodes except linux-02
and linux-04
are fully synced:
For some reason 02
and 04
got stuck on ~50-60 block distance. Lets try a restart.
Logs are full of Ignoring payload while snap syncing
on both hosts:
admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % tail -n3 /var/log/docker/geth-mainnet-node/docker.log
WARN [04-24|10:52:57.973] Ignoring payload while snap syncing number=17,115,671 hash=85673c..50867e
WARN [04-24|10:52:57.983] Ignoring payload while snap syncing number=17,115,672 hash=4aa076..81a036
WARN [04-24|10:52:58.013] Ignoring payload while snap syncing number=17,115,673 hash=788615..89a986
Seems like the issue is in trienodes:
Nodes on linux-02
and linux-04
refuse to sync fully:
admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3608450","sync_distance":"2684867","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293056","sync_distance":"261","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293120","sync_distance":"197","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293099","sync_distance":"218","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293056","sync_distance":"261","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6293086","sync_distance":"231","is_syncing":true,"is_optimistic":true}}
admin@linux-04.ih-eu-mda1.nimbus.mainnet:/data % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3564995","sync_distance":"2728326","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293090","sync_distance":"231","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293056","sync_distance":"265","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6293096","sync_distance":"225","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293056","sync_distance":"265","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293087","sync_distance":"234","is_syncing":true,"is_optimistic":true}}
Which is bizare, because the ERA files are in place and the trusted node for checkpoint sync is available.
Here we can see comparison of Web3 requests to the execultion layer node on some hosts:
unstable-02@linux-02.ih-eu-mda1.nimbus.mainnet
unstable-02@linux-01.ih-eu-mda1.nimbus.mainnet
The latter is synced, the former is not.
Seems to be making only 3 types of requests right now: getBlockByNumber
and exchangeTransitionConfiguration
.
The linux-01.ih-eu-mda1.nimbus.mainnet
host is nearly fully synced:
admin@linux-01.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"4350302","sync_distance":"1956324","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6306626","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6306626","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6306626","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6306626","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6306626","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
Aside from stable-01
which is syncing from scratch as intended.
One thing that worries me though is that the average load is slightly above 10.0
, and we have 10 cores:
admin@linux-01.ih-eu-mda1.nimbus.mainnet:~ % uptime
10:06:03 up 7 days, 21:05, 1 user, load average: 10.15, 10.99, 10.82
A noticeable portion of CPU time is being used by jounald
and rsyslog
due to high volume of logs.
I wonder if this high CPU saturation is an issue, or if it's fine for now. We do have an option to add a 2nd CPU if we want.
What do youthink @Menduist @zah? If you think this host looks fine as is I will start releasing some of our Hetzner ones.
Here's Grafana dashboards for linux-01.ih-eu-mda1.nimbus.mainnet
host:
https://grafana.infra.status.im/d/QCTZ8-Vmk/single-host-dashboard?orgId=1&refresh=1m&var-host=linux-01.ih-eu-mda1.nimbus.mainnet https://metrics.status.im/d/pgeNfj2Wz23/nimbus-fleet-testnets?orgId=1&refresh=5m&var-instance=linux-01.ih-eu-mda1.nimbus.mainnet&var-container=beacon-node-mainnet-stable-02&from=now-24h&to=now
Another option would be dropping one of the nodes to lessen CPU pressure. We could just get rid of stable-02
.
Most nodes on linux-02
and linux-04
are still syncing backwards:
admin@linux-04.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"4815694","sync_distance":"1500347","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6307691","sync_distance":"8350","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6307918","sync_distance":"8123","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6308232","sync_distance":"7809","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6315723","sync_distance":"318","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"6312724","sync_distance":"3317","is_syncing":true,"is_optimistic":true,"el_offline":false}}
No help in sight.
I'm going to decommission 6 out of 7 Hetzner hosts since they are mostly unusable now:
Changes to Outputs:
~ hosts = {
~ "metal-01.he-eu-hel1.nimbus.mainnet" = "95.217.87.121" -> "65.109.80.106"
- "metal-02.he-eu-hel1.nimbus.mainnet" = "135.181.0.33"
- "metal-03.he-eu-hel1.nimbus.mainnet" = "135.181.60.170"
- "metal-04.he-eu-hel1.nimbus.mainnet" = "65.21.193.229"
- "metal-05.he-eu-hel1.nimbus.mainnet" = "135.181.60.177"
- "metal-06.he-eu-hel1.nimbus.mainnet" = "135.181.56.50"
- "metal-07.he-eu-hel1.nimbus.mainnet" = "65.109.80.106"
# (26 unchanged attributes hidden)
}
I will reuse two hosts for Nimbus GitHub CI runners, the rest will be cancelled.
I'm going to keep old metal-03
and metal-05
since they have very low usage of their SSDs:
metal-01.he-eu-hel1.nimbus.mainnet
- 95.217.87.121
Power On Hours: 29,549
Power On Hours: 29,549
metal-02.he-eu-hel1.nimbus.mainnet
- 135.181.0.33
Power On Hours: 4,338
Power On Hours: 5,562
metal-03.he-eu-hel1.nimbus.mainnet
- 135.181.60.170
Power On Hours: 3,716
Power On Hours: 3,716
metal-04.he-eu-hel1.nimbus.mainnet
- 65.21.193.229
Power On Hours: 8,089
Power On Hours: 9,465
metal-05.he-eu-hel1.nimbus.mainnet
- 135.181.60.177
Power On Hours: 712
Power On Hours: 724
metal-06.he-eu-hel1.nimbus.mainnet
- 135.181.56.50
Power On Hours: 3,605
Power On Hours: 3,605
I inquired about our servers with Innova and this is what they said:
We still waiting for required CPU for servers to be able to activate more of them, as you want exact CPU model, which is depending on the supply company.
As for time, when new servers will be activated, we will set next due date a year from activation date, so you didn't loose any day, due to delayed activation.
Currently, we are out of stock of Intel E5-2690 v2, so we will look if we can get 20 Intel E5-2667 v3 or v4.
So I said we could get away with using E5-2690 v2 for Mainnet and then E5-2667 v4 for Prater:
I think we could be fine with E5-2667 v4 or v3(but consistently one or the other) for the rest of the hosts as long as we get just one more E5-2690 v2 host.
The reason for this is that we want to have 7 Mainnet hosts, and currently we have 5, and 1 is used for Prater. I could repurpose the Prater host for Mainnet, which would leave us at 6 out of 7. If we could get one more with E5-2690 v2 that would give us a full Mainnet fleet with the same CPUs.
Then the E5-2667 v4 could got for our Prater testnet hosts. The goal is to have consistent performance for a while network.
Which at least should give us consistent performance across a whole network.
I've also adjusted the layout of nodes on new Mainnet hosts to use the --no-el
flag for all even numbered nodes:
Because the latency graphs for the execution layer endpoint are not pretty:
Which suggests we are abusing it too much with 6 beacon nodes connected to the same Geth node.
It definitely makes a difference:
Well, the result of adding --no-el
flag to 02
nodes is clear:
admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"6372901","sync_distance":"19942","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6392843","sync_distance":"0","is_syncing":false,"is_optimistic":true}}
{"data":{"head_slot":"6365190","sync_distance":"27653","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6392843","sync_distance":"0","is_syncing":false,"is_optimistic":true}}
{"data":{"head_slot":"6387811","sync_distance":"5032","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"6392843","sync_distance":"0","is_syncing":false,"is_optimistic":true,"el_offline":true}}
The ones that have it sync fine, the rest fails to sync.
The latency on EL node responses is very good:
So I don't think that's the issue.
I've received a 7th host from InnovaHosting with the Xeon E5-2690 CPU. With the Prater host we will have full Mainnet fleet.
I'm going to re-purpose linux-01.ih-eu-mda1.nimbus.prater
as linux-07.ih-eu-mda1.nimbus.mainnet
so the whole Mainnet fleet uses E5-2690 for consistent performance across the whole fleet.
Changes:
Result:
> a ih-eu-mda1 -o -a 'lscpu | grep "Model name"' | sort
linux-01.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-02.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-03.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-04.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-05.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-06.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-07.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
The nodes are syncing, but I'll rsync ERA files to them manually:
admin@linux-06.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"315215","sync_distance":"6087220","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"313739","sync_distance":"6088696","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"265347","sync_distance":"6137088","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"226308","sync_distance":"6176127","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"218076","sync_distance":"6184359","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"175205","sync_distance":"6227230","is_syncing":true,"is_optimistic":false,"el_offline":true}}
admin@linux-07.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9304); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"326049","sync_distance":"6076388","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"128110","sync_distance":"6274327","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"103949","sync_distance":"6298488","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"71498","sync_distance":"6330939","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"41378","sync_distance":"6361059","is_syncing":true,"is_optimistic":false}}
I have decomissioned the last Mainnet Hetzner host:
And cancelled the host subscription:
Looks like the remaining servers should show up next week:
We got first 3 hosts, which I will use for Sepolia and Prater:
Bootstrapped and configured the hosts:
They nodes are syncing:
admin@linux-01.ih-eu-mda1.nimbus.sepolia:~ % for port in $(seq 9311 9314); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"201883","sync_distance":"2295315","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"159219","sync_distance":"2337979","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"84759","sync_distance":"2412439","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"3540","sync_distance":"2493658","is_syncing":true,"is_optimistic":false,"el_offline":fa
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"20411","sync_distance":"5745667","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"14122","sync_distance":"5751956","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"8505","sync_distance":"5757573","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"2303","sync_distance":"5763775","is_syncing":true,"is_optimistic":false,"el_offline":fals
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"33307","sync_distance":"5732917","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"9065","sync_distance":"5757159","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"3551","sync_distance":"5762673","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"11717","sync_distance":"5754507","is_syncing":true,"is_optimistic":false,"el_offline":false}}
It appears the Sepolia host has synced fully without issues:
admin@linux-01.ih-eu-mda1.nimbus.sepolia:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
admin@linux-01.ih-eu-mda1.nimbus.sepolia:~ % for port in $(seq 9311 9314); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"2518357","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"2518357","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"2518357","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"2518357","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
admin@linux-01.ih-eu-mda1.nimbus.sepolia:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
admin@linux-01.ih-eu-mda1.nimbus.sepolia:~ % sudo du -hsc /data/* /docker/*
17G /data/beacon-node-sepolia-unstable-01
17G /data/beacon-node-sepolia-unstable-02
17G /data/beacon-node-sepolia-unstable-03
17G /data/beacon-node-sepolia-unstable-04
17G /data/beacon-node-sepolia-unstable-trial-01
16K /data/lost+found
8.1G /data/nimbus-eth1-sepolia-master-trial
2.6G /data/validator-client-sepolia-unstable-01
2.6G /data/validator-client-sepolia-unstable-02
2.6G /data/validator-client-sepolia-unstable-03
2.6G /data/validator-client-sepolia-unstable-04
29G /docker/geth-sepolia-01
29G /docker/geth-sepolia-02
29G /docker/geth-sepolia-03
29G /docker/geth-sepolia-04
4.2M /docker/log
16K /docker/lost+found
213G total
Time to deploy validators to the new host.
I have removed the validators from old Sepolia host and deployed them to the new host:
The validators missed about 6 epochs each:
Prater nodes are still syncing:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"2025440","sync_distance":"3763942","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"2005777","sync_distance":"3783605","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"2019396","sync_distance":"3769986","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"1574778","sync_distance":"4214604","is_syncing":true,"is_optimistic":false,"el_offline":false}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/*
20G /data/beacon-node-prater-libp2p
22G /data/beacon-node-prater-stable
22G /data/beacon-node-prater-testing
37G /data/beacon-node-prater-unstable
4.0K /data/era
16K /data/lost+found
98G total
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"2056079","sync_distance":"3733306","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"2015206","sync_distance":"3774179","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"2046075","sync_distance":"3743310","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"1600162","sync_distance":"4189223","is_syncing":true,"is_optimistic":false,"el_offline":false}}
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/*
20G /data/beacon-node-prater-libp2p
22G /data/beacon-node-prater-stable
36G /data/beacon-node-prater-testing
22G /data/beacon-node-prater-unstable
4.0K /data/era
16K /data/lost+found
98G total
Still syncing:
But Geth nodes are less than 1 mil away from syncing:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c '.result | { currentBlock, highestBlock }'; done
{"currentBlock":"0x850633","highestBlock":"0x850674"}
{"currentBlock":"0x84ecbf","highestBlock":"0x84ed00"}
{"currentBlock":"0x85324e","highestBlock":"0x85378a"}
{"currentBlock":"0x7985d9","highestBlock":"0x79861a"}
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c '.result | { currentBlock, highestBlock }'; done
{"currentBlock":"0x858bb7","highestBlock":"0x858bf8"}
{"currentBlock":"0x857f80","highestBlock":"0x857fc1"}
{"currentBlock":"0x858dd7","highestBlock":"0x8595ce"}
{"currentBlock":"0x7a0b01","highestBlock":"0x7a0b42"}
So we should be able to switch the public API endpoints to these hosts within a day or two.
Looks like we are off by about 100k:
So this should finish soon, and then the Trienode syncing will start. So possibly might be done tomorrow.
Looks like most nodes finished syncing, except 3:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"5853044","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5853044","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5853044","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"5106185","sync_distance":"746859","is_syncing":true,"is_optimistic":true,"el_offline":false}}
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"5853072","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5853072","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5853072","sync_distance":"0","is_syncing":false,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"5142760","sync_distance":"710312","is_syncing":true,"is_optimistic":true,"el_offline":false}}
It appears sometimes Geth nodes get to the Trienodes sync stage and then just stop:
No progress whatsoever. I guess it's time for a restart.
In hindsight I should have upgraded to Geth 1.12.0
so we could start syncing with the new Pebble database instead of LevelDB:
https://github.com/ethereum/go-ethereum/releases/tag/v1.12.0
It appears only 04
Geth nodes on new Prater hosts are not fully synced, but almost there:
The rest is done:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"5721632","sync_distance":"144944","is_syncing":true,"is_optimistic":true,"el_offline":fals
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"5751773","sync_distance":"114803","is_syncing":true,"is_optimistic":true,"el_offline":false}}
The remaining nodes are only libp2p, so maybe it's time to switch API and validators to the new hosts.
I have moved the validators from linux-02
to new linux-02
on InnovaHosting server:
Effect:
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % sudo find /data/beacon-node-prater-{stable,testing,unstable,libp2p}/data/secrets/ -type f
/data/beacon-node-prater-stable/data/secrets/0x94b906d2efe55dbf622d2790fb5ac11dead1b90414ee8728c3912f189f96ae29b1b784047418acaa052a46cedd6821e4
/data/beacon-node-prater-testing/data/secrets/0x94b98aba01a83401cad0c8929a3bba5ec78393a73cff54689c9114d719c61fd74c44b4750e6aaeb14c820e77feb8e419
/data/beacon-node-prater-unstable/data/secrets/0x94bcce71396877f3c16d9aa6dadcdef060e1a248ea063cc68892bfa6969ed5af3eb8bc6b0d66476ef37e0468345da8c0
/data/beacon-node-prater-libp2p/data/secrets/0x94bdf6db0d7d429da5ec1eb198543bd87281f7e3eb21600ec0fe2140cf3516cbf28d0bf03bfb35ed19a2ac2dd7988cb7
The libp2p
node should be up shortly.
And the API endpoints are up as well:
We have received another 4 hosts from Innova:
Will bootstrap them today.
I have bootstrapped the 4 new hosts:
And the nodes are syncing:
admin@linux-03.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"218401","sync_distance":"5698081","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"211451","sync_distance":"5705031","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"203154","sync_distance":"5713328","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"144955","sync_distance":"5771527","is_syncing":true,"is_optimistic":false,"el_offline":false}}
admin@linux-04.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"191968","sync_distance":"5724515","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"212027","sync_distance":"5704456","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"185232","sync_distance":"5731251","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"144573","sync_distance":"5771910","is_syncing":true,"is_optimistic":false,"el_offline":false}}
admin@linux-05.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"178278","sync_distance":"5738206","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"172308","sync_distance":"5744176","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"133745","sync_distance":"5782739","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"121869","sync_distance":"5794615","is_syncing":true,"is_optimistic":false,"el_offline":false}}
admin@linux-06.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"230025","sync_distance":"5686814","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"166794","sync_distance":"5750045","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"20428","sync_distance":"5896411","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"12032","sync_distance":"5904807","is_syncing":true,"is_optimistic":false,"el_offline":false}}
We have received another 6 servers from InnovaHosting.
After reviewing out storage needs on Prater hosts I've decided to ask them to move 1 NVMe from each of the new servers to our nimbus.prater
fleet hosts to fix the issue with not enough storage for Geth nodes:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % df -h /docker
Filesystem Size Used Avail Use% Mounted on
/dev/sdc 1.5T 1.1T 315G 78% /docker
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /docker/*
268G /docker/geth-goerli-01
269G /docker/geth-goerli-02
270G /docker/geth-goerli-03
271G /docker/geth-goerli-04
5.1M /docker/log
16K /docker/lost+found
1.1T total
I've created a ticket:
Hello,
Thanks for delivering another 6 servers.
After rethinking our setup and our storage needs I was wondering if it would be possible to remove just one 1.6 TB NVMe from each of the new
server9724
toserver9729
servers and add them to the following ones:
server9717
-185.181.230.78
- Real name:linux-01.ih-eu-mda1.nimbus.prater
server9718
-185.181.230.79
- Real name:linux-02.ih-eu-mda1.nimbus.prater
server9721
-185.181.230.121
- Real name:linux-03.ih-eu-mda1.nimbus.prater
server9720
-194.33.40.231
- Real name:linux-04.ih-eu-mda1.nimbus.prater
server9722
-194.33.40.232
- Real name:linux-05.ih-eu-mda1.nimbus.prater
server9723
-194.33.40.233
- Real name:linux-06.ih-eu-mda1.nimbus.prater
Maybe you could also update the names while you're at it.
https://client.innovahosting.net/viewticket.php?tid=527485&c=B8CyWeCt
They have moved the NVMe's and I have received this response:
I have moved disks from server to server as you requested.
But to see them on server, you have to create logical drive on them via ssacli https://gist.github.com/mrpeardotnet/a9ce41da99936c0175600f484fa20d03
Also it is good to delete logical drive which are not existing anymore from servers from where we have extracted drives.
Be careful when you are using ssacli.
https://client.innovahosting.net/viewticket.php?tid=527485&c=B8CyWeCt
And indeed, after installing ssacli
I can see the extra drive as Unassigned
:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo ssacli ctrl slot=0 pd all show
Smart Array P840ar in Slot 0 (Embedded)
Array A
physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS SSD, 400 GB, OK)
Array B
physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS SSD, 1.6 TB, OK)
Array C
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS SSD, 1.6 TB, OK)
Unassigned
physicaldrive 2I:0:6 (port 2I:box 0:bay 6, SAS SSD, 1.6 TB, OK)
So according to the doc I have to run:
ssacli ctrl slot=0 create type=ld drives=2I:0:6 raid=0
Which should create new RAID 0 logical drive on controller slot 0
, disk in port 2I
, box 0
, and bay 6
.
If I look at details of Array A
we can see that it has Fault Tolerance: 0
:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo ssacli ctrl slot=0 ld 2 show detail
Smart Array P840ar in Slot 0 (Embedded)
Array B
Logical Drive: 2
Size: 1.46 TB
Fault Tolerance: 0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 256 KB
Full Stripe Size: 256 KB
Status: OK
MultiDomain Status: OK
Caching: Disabled
Unique Identifier: 600508B1001C8754D0E5B226B43349AF
Disk Name: /dev/sdb
Mount Points: /mnt/sdb,/data 1.5 TB Partition Number 0
Drive Type: Data
LD Acceleration Method: Smart Path
Which I assume means it's RAID 0.
One thing I could do is to combine two 1.6 TB drives into one RAID 0 volume of 3.2 TB for the Geth nodes. That way I could avoid having to split things up into two /docker1
and /docker2
folders. The disadvantage of that is that it doubles the chances of the array failing due to just one of two NVMes. Although considering this is just a testnet we could live with that.
Yeah, so I can delete the array and now I have two unassigned drives:
admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo ssacli ctrl slot=0 ld 3 delete
Warning: Deleting an array can cause other array letters to become renamed.
E.g. Deleting array A from arrays A,B,C will result in two remaining
arrays A,B ... not B,C
Warning: Deleting the specified device(s) will result in data being lost.
Continue? (y/n) y
admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo ssacli ctrl slot=0 pd all show
Smart Array P440ar in Slot 0 (Embedded)
Array A
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)
Array B
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)
Unassigned
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS SSD, 1.6 TB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS SSD, 1.6 TB, OK)
And I can get a RAID 0 out of them.
And now that looks sensible:
admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo ssacli ctrl slot=0 create type=ld drives=1I:1:2,1I:1:3 raid=0
Warning: SSD Over Provisioning Optimization will be performed on the physical
drives in this array. This process may take a long time and cause this
application to appear unresponsive.
admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo ssacli ctrl slot=0 pd all show
Smart Array P440ar in Slot 0 (Embedded)
Array A
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)
Array B
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)
Array C
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS SSD, 1.6 TB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS SSD, 1.6 TB, OK)
And it is now visible via fdisk
:
admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo fdisk -l /dev/sdc
Disk /dev/sdc: 2.91 TiB, 3200575168512 bytes, 6251123376 sectors
Disk model: LOGICAL VOLUME
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 262144 bytes / 524288 byte
I cleaned up the partition table just in case:
admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo wipefs -a /dev/sdc
/dev/sdc: 2 bytes were erased at offset 0x00000438 (ext4): 53 ef
And after running:
> ap ansible/bootstrap.yml -t role::bootstrap:volumes -l linux-06.ih-eu-mda1.nimbus.prater
I have it mounted properly:
admin@linux-06.ih-eu-mda1.nimbus.prater:/ % df -h /docker
Filesystem Size Used Avail Use% Mounted on
/dev/sdc 2.9T 28K 2.8T 1% /docker
Neat.
I've also bootstrapped 3 nimbus.geth
hosts:
Since Hetzner hates crypto mining and states so in their policy:
https://www.hetzner.com/legal/dedicated-server/
We will need to migrate out hosts to an alternative server provider soon. To do that we need to find a suitable alternative.
The general requirement involve:
We'll need about 21 hosts with similar specs.