status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 5 forks source link

Research alternatives to Hetzner for testnet hosting #132

Closed jakubgs closed 1 year ago

jakubgs commented 2 years ago

Since Hetzner hates crypto mining and states so in their policy:

Therefore the following actions are prohibited:

  • Operating applications that are used to mine crypto currencies

https://www.hetzner.com/legal/dedicated-server/

We will need to migrate out hosts to an alternative server provider soon. To do that we need to find a suitable alternative.

The general requirement involve:

We'll need about 21 hosts with similar specs.

tersec commented 1 year ago

Would guess https://github.com/status-im/nimbus-eth2/commit/75be7d267d04ed2e5cb46553cb21ec83ee052a09 or https://github.com/status-im/nimbus-eth2/commit/8b3ffec0d530f3bfaa83198de098cb199f81b921

jakubgs commented 1 year ago

Here's the current state as far as I can tell:

Date Host Node Commit Result
2023-04-11 linux-02 unstable-01 https://github.com/status-im/nimbus-eth2/commit/c3d043c0 :heavy_check_mark:
2023-04-16 linux-01 unstable-02 https://github.com/status-im/nimbus-eth2/commit/57623af3 :heavy_check_mark:
2023-04-17 linux-03 unstable-01 https://github.com/status-im/nimbus-eth2/commit/4df851f4 :heavy_check_mark:
2023-04-11 linux-02 unstable-01 https://github.com/status-im/nimbus-eth2/commit/75be7d26 :heavy_check_mark:
2023-04-17 linux-03 unstable-01 https://github.com/status-im/nimbus-eth2/commit/176c80a3 :heavy_check_mark:
2023-04-09 linux-01 unstable-02 https://github.com/status-im/nimbus-eth2/commit/7df75d77 :heavy_check_mark:
2023-03-22 linux-04 unstable-01 https://github.com/status-im/nimbus-eth2/commit/8b3ffec0 :x:
2023-03-30 linux-04 unstable-02 https://github.com/status-im/nimbus-eth2/commit/228e10f1 :x:
2023-04-17 linux-02 unstable-02 https://github.com/status-im/nimbus-eth2/commit/b5115215 :x:
2023-04-18 linux-03 unstable-02 https://github.com/status-im/nimbus-eth2/commit/1d3e8382 :x:

The issue seems to be in https://github.com/status-im/nimbus-eth2/commit/8b3ffec0.

tersec commented 1 year ago

https://github.com/status-im/nimbus-eth2/pull/4840

jakubgs commented 1 year ago

I've tested https://github.com/status-im/nimbus-eth2/pull/4840 and it appears to fix the issue:

admin@linux-01.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9304 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"6257820","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6257820","sync_distance":"0","is_syncing":false,"is_optimistic":false}}

admin@linux-01.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9304 9305); do c 0:$port/eth/v1/node/version | jq -c; done
{"data":{"version":"Nimbus/v23.3.2-75be7d-stateofus"}}
{"data":{"version":"Nimbus/v23.3.2-751d9d-stateofus"}}
jakubgs commented 1 year ago

For some reason linux-05 geth synced in record time of under 24 hours:

image

admin@linux-05.ih-eu-mda1.nimbus.mainnet:~ % /docker/geth-mainnet/rpc.sh eth_syncing
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": false
}

admin@linux-05.ih-eu-mda1.nimbus.mainnet:~ % sudo du -hs /docker/geth-mainnet
804G    /docker/geth-mainnet
jakubgs commented 1 year ago

What I don't get is why some graphs on nodes look like this:

beacon-node-mainnet-unstable-01@linux-05.ih-eu-mda1.nimbus.mainnet

image

beacon-node-mainnet-unstable-02@linux-05.ih-eu-mda1.nimbus.mainnet

image


While the other unstable node on the ame host looks like this. Why is there such a big difference?

jakubgs commented 1 year ago

Two geth nodes are still not synced:

image

It's linux-02 and linux-04 that needs some more time.

I made a bad decision to sync linux-01 from metal-01, and it's still going:

image

jakubgs commented 1 year ago

All nodes except linux-02 and linux-04 are fully synced:

image

For some reason 02 and 04 got stuck on ~50-60 block distance. Lets try a restart.

jakubgs commented 1 year ago

Logs are full of Ignoring payload while snap syncing on both hosts:

admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % tail -n3 /var/log/docker/geth-mainnet-node/docker.log
WARN [04-24|10:52:57.973] Ignoring payload while snap syncing      number=17,115,671 hash=85673c..50867e
WARN [04-24|10:52:57.983] Ignoring payload while snap syncing      number=17,115,672 hash=4aa076..81a036
WARN [04-24|10:52:58.013] Ignoring payload while snap syncing      number=17,115,673 hash=788615..89a986

Seems like the issue is in trienodes:

image

jakubgs commented 1 year ago

Nodes on linux-02 and linux-04 refuse to sync fully:

admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3608450","sync_distance":"2684867","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293056","sync_distance":"261","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293120","sync_distance":"197","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293099","sync_distance":"218","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293056","sync_distance":"261","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6293086","sync_distance":"231","is_syncing":true,"is_optimistic":true}}
admin@linux-04.ih-eu-mda1.nimbus.mainnet:/data % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3564995","sync_distance":"2728326","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293090","sync_distance":"231","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293056","sync_distance":"265","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6293096","sync_distance":"225","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293056","sync_distance":"265","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6293087","sync_distance":"234","is_syncing":true,"is_optimistic":true}}

Which is bizare, because the ERA files are in place and the trusted node for checkpoint sync is available.

jakubgs commented 1 year ago

Here we can see comparison of Web3 requests to the execultion layer node on some hosts:

unstable-02@linux-02.ih-eu-mda1.nimbus.mainnet

image

unstable-02@linux-01.ih-eu-mda1.nimbus.mainnet

image


The latter is synced, the former is not. Seems to be making only 3 types of requests right now: getBlockByNumber and exchangeTransitionConfiguration.

jakubgs commented 1 year ago

The linux-01.ih-eu-mda1.nimbus.mainnet host is nearly fully synced:

admin@linux-01.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"4350302","sync_distance":"1956324","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6306626","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6306626","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6306626","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6306626","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6306626","sync_distance":"0","is_syncing":false,"is_optimistic":false}}

Aside from stable-01 which is syncing from scratch as intended.

One thing that worries me though is that the average load is slightly above 10.0, and we have 10 cores:

admin@linux-01.ih-eu-mda1.nimbus.mainnet:~ % uptime
 10:06:03 up 7 days, 21:05,  1 user,  load average: 10.15, 10.99, 10.82

image

A noticeable portion of CPU time is being used by jounald and rsyslog due to high volume of logs.

I wonder if this high CPU saturation is an issue, or if it's fine for now. We do have an option to add a 2nd CPU if we want.

What do youthink @Menduist @zah? If you think this host looks fine as is I will start releasing some of our Hetzner ones.

jakubgs commented 1 year ago

Here's Grafana dashboards for linux-01.ih-eu-mda1.nimbus.mainnet host:

https://grafana.infra.status.im/d/QCTZ8-Vmk/single-host-dashboard?orgId=1&refresh=1m&var-host=linux-01.ih-eu-mda1.nimbus.mainnet https://metrics.status.im/d/pgeNfj2Wz23/nimbus-fleet-testnets?orgId=1&refresh=5m&var-instance=linux-01.ih-eu-mda1.nimbus.mainnet&var-container=beacon-node-mainnet-stable-02&from=now-24h&to=now

jakubgs commented 1 year ago

Another option would be dropping one of the nodes to lessen CPU pressure. We could just get rid of stable-02.

jakubgs commented 1 year ago

Most nodes on linux-02 and linux-04 are still syncing backwards:

admin@linux-04.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"4815694","sync_distance":"1500347","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6307691","sync_distance":"8350","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6307918","sync_distance":"8123","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6308232","sync_distance":"7809","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6315723","sync_distance":"318","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"6312724","sync_distance":"3317","is_syncing":true,"is_optimistic":true,"el_offline":false}}

No help in sight.

jakubgs commented 1 year ago

I'm going to decommission 6 out of 7 Hetzner hosts since they are mostly unusable now:

Changes to Outputs:
  ~ hosts = {
      ~ "metal-01.he-eu-hel1.nimbus.mainnet"                = "95.217.87.121" -> "65.109.80.106"
      - "metal-02.he-eu-hel1.nimbus.mainnet"                = "135.181.0.33"
      - "metal-03.he-eu-hel1.nimbus.mainnet"                = "135.181.60.170"
      - "metal-04.he-eu-hel1.nimbus.mainnet"                = "65.21.193.229"
      - "metal-05.he-eu-hel1.nimbus.mainnet"                = "135.181.60.177"
      - "metal-06.he-eu-hel1.nimbus.mainnet"                = "135.181.56.50"
      - "metal-07.he-eu-hel1.nimbus.mainnet"                = "65.109.80.106"
        # (26 unchanged attributes hidden)
    }

I will reuse two hosts for Nimbus GitHub CI runners, the rest will be cancelled.

jakubgs commented 1 year ago

I'm going to keep old metal-03 and metal-05 since they have very low usage of their SSDs:

metal-01.he-eu-hel1.nimbus.mainnet - 95.217.87.121

Power On Hours:                     29,549
Power On Hours:                     29,549

metal-02.he-eu-hel1.nimbus.mainnet - 135.181.0.33

Power On Hours:                     4,338
Power On Hours:                     5,562

metal-03.he-eu-hel1.nimbus.mainnet - 135.181.60.170

Power On Hours:                     3,716
Power On Hours:                     3,716

metal-04.he-eu-hel1.nimbus.mainnet - 65.21.193.229

Power On Hours:                     8,089
Power On Hours:                     9,465

metal-05.he-eu-hel1.nimbus.mainnet - 135.181.60.177

Power On Hours:                     712
Power On Hours:                     724

image

metal-06.he-eu-hel1.nimbus.mainnet - 135.181.56.50

Power On Hours:                     3,605
Power On Hours:                     3,605
jakubgs commented 1 year ago

I inquired about our servers with Innova and this is what they said:

We still waiting for required CPU for servers to be able to activate more of them, as you want exact CPU model, which is depending on the supply company.

As for time, when new servers will be activated, we will set next due date a year from activation date, so you didn't loose any day, due to delayed activation.

Currently, we are out of stock of Intel E5-2690 v2, so we will look if we can get 20 Intel E5-2667 v3 or v4.

So I said we could get away with using E5-2690 v2 for Mainnet and then E5-2667 v4 for Prater:

I think we could be fine with E5-2667 v4 or v3(but consistently one or the other) for the rest of the hosts as long as we get just one more E5-2690 v2 host.

The reason for this is that we want to have 7 Mainnet hosts, and currently we have 5, and 1 is used for Prater. I could repurpose the Prater host for Mainnet, which would leave us at 6 out of 7. If we could get one more with E5-2690 v2 that would give us a full Mainnet fleet with the same CPUs.

Then the E5-2667 v4 could got for our Prater testnet hosts. The goal is to have consistent performance for a while network.

Which at least should give us consistent performance across a whole network.

jakubgs commented 1 year ago

I've also adjusted the layout of nodes on new Mainnet hosts to use the --no-el flag for all even numbered nodes:

Because the latency graphs for the execution layer endpoint are not pretty:

image

Which suggests we are abusing it too much with 6 beacon nodes connected to the same Geth node.

jakubgs commented 1 year ago

It definitely makes a difference:

image

jakubgs commented 1 year ago

Well, the result of adding --no-el flag to 02 nodes is clear:

admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"6372901","sync_distance":"19942","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6392843","sync_distance":"0","is_syncing":false,"is_optimistic":true}}
{"data":{"head_slot":"6365190","sync_distance":"27653","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6392843","sync_distance":"0","is_syncing":false,"is_optimistic":true}}
{"data":{"head_slot":"6387811","sync_distance":"5032","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"6392843","sync_distance":"0","is_syncing":false,"is_optimistic":true,"el_offline":true}}

The ones that have it sync fine, the rest fails to sync.

jakubgs commented 1 year ago

The latency on EL node responses is very good:

image

So I don't think that's the issue.

jakubgs commented 1 year ago

I've received a 7th host from InnovaHosting with the Xeon E5-2690 CPU. With the Prater host we will have full Mainnet fleet.

I'm going to re-purpose linux-01.ih-eu-mda1.nimbus.prater as linux-07.ih-eu-mda1.nimbus.mainnet so the whole Mainnet fleet uses E5-2690 for consistent performance across the whole fleet.

Changes:

Result:

 > a ih-eu-mda1 -o -a 'lscpu | grep "Model name"' | sort
linux-01.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-02.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-03.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-04.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-05.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-06.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-07.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 | (stdout) Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
jakubgs commented 1 year ago

The nodes are syncing, but I'll rsync ERA files to them manually:

admin@linux-06.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"315215","sync_distance":"6087220","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"313739","sync_distance":"6088696","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"265347","sync_distance":"6137088","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"226308","sync_distance":"6176127","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"218076","sync_distance":"6184359","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"175205","sync_distance":"6227230","is_syncing":true,"is_optimistic":false,"el_offline":true}}
admin@linux-07.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9304); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"326049","sync_distance":"6076388","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"128110","sync_distance":"6274327","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"103949","sync_distance":"6298488","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"71498","sync_distance":"6330939","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"41378","sync_distance":"6361059","is_syncing":true,"is_optimistic":false}}
jakubgs commented 1 year ago

I have decomissioned the last Mainnet Hetzner host:

And cancelled the host subscription:

image

jakubgs commented 1 year ago

Looks like the remaining servers should show up next week:

image

jakubgs commented 1 year ago

We got first 3 hosts, which I will use for Sepolia and Prater:

image

jakubgs commented 1 year ago

Bootstrapped and configured the hosts:

They nodes are syncing:

admin@linux-01.ih-eu-mda1.nimbus.sepolia:~ % for port in $(seq 9311 9314); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"201883","sync_distance":"2295315","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"159219","sync_distance":"2337979","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"84759","sync_distance":"2412439","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"3540","sync_distance":"2493658","is_syncing":true,"is_optimistic":false,"el_offline":fa
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"20411","sync_distance":"5745667","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"14122","sync_distance":"5751956","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"8505","sync_distance":"5757573","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"2303","sync_distance":"5763775","is_syncing":true,"is_optimistic":false,"el_offline":fals
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"33307","sync_distance":"5732917","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"9065","sync_distance":"5757159","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"3551","sync_distance":"5762673","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"11717","sync_distance":"5754507","is_syncing":true,"is_optimistic":false,"el_offline":false}}
jakubgs commented 1 year ago

It appears the Sepolia host has synced fully without issues:

admin@linux-01.ih-eu-mda1.nimbus.sepolia:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
admin@linux-01.ih-eu-mda1.nimbus.sepolia:~ % for port in $(seq 9311 9314); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"2518357","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"2518357","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"2518357","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"2518357","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
admin@linux-01.ih-eu-mda1.nimbus.sepolia:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done         
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
admin@linux-01.ih-eu-mda1.nimbus.sepolia:~ % sudo du -hsc /data/* /docker/*                                              
17G /data/beacon-node-sepolia-unstable-01
17G /data/beacon-node-sepolia-unstable-02
17G /data/beacon-node-sepolia-unstable-03
17G /data/beacon-node-sepolia-unstable-04
17G /data/beacon-node-sepolia-unstable-trial-01
16K /data/lost+found
8.1G    /data/nimbus-eth1-sepolia-master-trial
2.6G    /data/validator-client-sepolia-unstable-01
2.6G    /data/validator-client-sepolia-unstable-02
2.6G    /data/validator-client-sepolia-unstable-03
2.6G    /data/validator-client-sepolia-unstable-04
29G /docker/geth-sepolia-01
29G /docker/geth-sepolia-02
29G /docker/geth-sepolia-03
29G /docker/geth-sepolia-04
4.2M    /docker/log
16K /docker/lost+found
213G    total

Time to deploy validators to the new host.

jakubgs commented 1 year ago

I have removed the validators from old Sepolia host and deployed them to the new host:

The validators missed about 6 epochs each:

image

https://sepolia.beaconcha.in/validator/661

jakubgs commented 1 year ago

Prater nodes are still syncing:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"2025440","sync_distance":"3763942","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"2005777","sync_distance":"3783605","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"2019396","sync_distance":"3769986","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"1574778","sync_distance":"4214604","is_syncing":true,"is_optimistic":false,"el_offline":false}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/*
20G /data/beacon-node-prater-libp2p
22G /data/beacon-node-prater-stable
22G /data/beacon-node-prater-testing
37G /data/beacon-node-prater-unstable
4.0K    /data/era
16K /data/lost+found
98G total
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"2056079","sync_distance":"3733306","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"2015206","sync_distance":"3774179","is_syncing":true,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"2046075","sync_distance":"3743310","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"1600162","sync_distance":"4189223","is_syncing":true,"is_optimistic":false,"el_offline":false}}
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/*
20G /data/beacon-node-prater-libp2p
22G /data/beacon-node-prater-stable
36G /data/beacon-node-prater-testing
22G /data/beacon-node-prater-unstable
4.0K    /data/era
16K /data/lost+found
98G total
jakubgs commented 1 year ago

Still syncing:

image

But Geth nodes are less than 1 mil away from syncing:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c '.result | { currentBlock, highestBlock }'; done
{"currentBlock":"0x850633","highestBlock":"0x850674"}
{"currentBlock":"0x84ecbf","highestBlock":"0x84ed00"}
{"currentBlock":"0x85324e","highestBlock":"0x85378a"}
{"currentBlock":"0x7985d9","highestBlock":"0x79861a"}
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c '.result | { currentBlock, highestBlock }'; done
{"currentBlock":"0x858bb7","highestBlock":"0x858bf8"}
{"currentBlock":"0x857f80","highestBlock":"0x857fc1"}
{"currentBlock":"0x858dd7","highestBlock":"0x8595ce"}
{"currentBlock":"0x7a0b01","highestBlock":"0x7a0b42"}

So we should be able to switch the public API endpoints to these hosts within a day or two.

jakubgs commented 1 year ago

Looks like we are off by about 100k:

image

So this should finish soon, and then the Trienode syncing will start. So possibly might be done tomorrow.

jakubgs commented 1 year ago

Looks like most nodes finished syncing, except 3:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"5853044","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5853044","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5853044","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"5106185","sync_distance":"746859","is_syncing":true,"is_optimistic":true,"el_offline":false}}
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"5853072","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5853072","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5853072","sync_distance":"0","is_syncing":false,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"5142760","sync_distance":"710312","is_syncing":true,"is_optimistic":true,"el_offline":false}}
jakubgs commented 1 year ago

It appears sometimes Geth nodes get to the Trienodes sync stage and then just stop:

image

No progress whatsoever. I guess it's time for a restart.

jakubgs commented 1 year ago

In hindsight I should have upgraded to Geth 1.12.0 so we could start syncing with the new Pebble database instead of LevelDB: https://github.com/ethereum/go-ethereum/releases/tag/v1.12.0

jakubgs commented 1 year ago

It appears only 04 Geth nodes on new Prater hosts are not fully synced, but almost there:

image

The rest is done:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"5721632","sync_distance":"144944","is_syncing":true,"is_optimistic":true,"el_offline":fals
admin@linux-02.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":true}}
{"data":{"head_slot":"5866576","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"5751773","sync_distance":"114803","is_syncing":true,"is_optimistic":true,"el_offline":false}}

The remaining nodes are only libp2p, so maybe it's time to switch API and validators to the new hosts.

jakubgs commented 1 year ago

I have moved the validators from linux-02 to new linux-02 on InnovaHosting server:

Effect:

admin@linux-02.ih-eu-mda1.nimbus.prater:~ % sudo find /data/beacon-node-prater-{stable,testing,unstable,libp2p}/data/secrets/ -type f
/data/beacon-node-prater-stable/data/secrets/0x94b906d2efe55dbf622d2790fb5ac11dead1b90414ee8728c3912f189f96ae29b1b784047418acaa052a46cedd6821e4
/data/beacon-node-prater-testing/data/secrets/0x94b98aba01a83401cad0c8929a3bba5ec78393a73cff54689c9114d719c61fd74c44b4750e6aaeb14c820e77feb8e419
/data/beacon-node-prater-unstable/data/secrets/0x94bcce71396877f3c16d9aa6dadcdef060e1a248ea063cc68892bfa6969ed5af3eb8bc6b0d66476ef37e0468345da8c0
/data/beacon-node-prater-libp2p/data/secrets/0x94bdf6db0d7d429da5ec1eb198543bd87281f7e3eb21600ec0fe2140cf3516cbf28d0bf03bfb35ed19a2ac2dd7988cb7

The libp2p node should be up shortly.

And the API endpoints are up as well:

image

jakubgs commented 1 year ago

We have received another 4 hosts from Innova:

image

Will bootstrap them today.

jakubgs commented 1 year ago

I have bootstrapped the 4 new hosts:

And the nodes are syncing:

admin@linux-03.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"218401","sync_distance":"5698081","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"211451","sync_distance":"5705031","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"203154","sync_distance":"5713328","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"144955","sync_distance":"5771527","is_syncing":true,"is_optimistic":false,"el_offline":false}}
admin@linux-04.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"191968","sync_distance":"5724515","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"212027","sync_distance":"5704456","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"185232","sync_distance":"5731251","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"144573","sync_distance":"5771910","is_syncing":true,"is_optimistic":false,"el_offline":false}}
admin@linux-05.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"178278","sync_distance":"5738206","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"172308","sync_distance":"5744176","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"133745","sync_distance":"5782739","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"121869","sync_distance":"5794615","is_syncing":true,"is_optimistic":false,"el_offline":false}}
admin@linux-06.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"230025","sync_distance":"5686814","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"166794","sync_distance":"5750045","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"20428","sync_distance":"5896411","is_syncing":true,"is_optimistic":false,"el_offline":false}}
{"data":{"head_slot":"12032","sync_distance":"5904807","is_syncing":true,"is_optimistic":false,"el_offline":false}}
jakubgs commented 1 year ago

We have received another 6 servers from InnovaHosting.

After reviewing out storage needs on Prater hosts I've decided to ask them to move 1 NVMe from each of the new servers to our nimbus.prater fleet hosts to fix the issue with not enough storage for Geth nodes:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % df -h /docker    
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        1.5T  1.1T  315G  78% /docker
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /docker/*
268G    /docker/geth-goerli-01
269G    /docker/geth-goerli-02
270G    /docker/geth-goerli-03
271G    /docker/geth-goerli-04
5.1M    /docker/log
16K /docker/lost+found
1.1T    total

I've created a ticket:

Hello,

Thanks for delivering another 6 servers.

After rethinking our setup and our storage needs I was wondering if it would be possible to remove just one 1.6 TB NVMe from each of the new server9724 to server9729 servers and add them to the following ones:

  • server9717 - 185.181.230.78 - Real name: linux-01.ih-eu-mda1.nimbus.prater
  • server9718 - 185.181.230.79 - Real name: linux-02.ih-eu-mda1.nimbus.prater
  • server9721 - 185.181.230.121 - Real name: linux-03.ih-eu-mda1.nimbus.prater
  • server9720 - 194.33.40.231 - Real name: linux-04.ih-eu-mda1.nimbus.prater
  • server9722 - 194.33.40.232 - Real name: linux-05.ih-eu-mda1.nimbus.prater
  • server9723 - 194.33.40.233 - Real name: linux-06.ih-eu-mda1.nimbus.prater

Maybe you could also update the names while you're at it.

https://client.innovahosting.net/viewticket.php?tid=527485&c=B8CyWeCt

jakubgs commented 1 year ago

They have moved the NVMe's and I have received this response:

I have moved disks from server to server as you requested.

But to see them on server, you have to create logical drive on them via ssacli https://gist.github.com/mrpeardotnet/a9ce41da99936c0175600f484fa20d03

Also it is good to delete logical drive which are not existing anymore from servers from where we have extracted drives.

Be careful when you are using ssacli.

https://client.innovahosting.net/viewticket.php?tid=527485&c=B8CyWeCt

jakubgs commented 1 year ago

And indeed, after installing ssacli I can see the extra drive as Unassigned:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo ssacli ctrl slot=0 pd all show

Smart Array P840ar in Slot 0 (Embedded)

   Array A

      physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS SSD, 400 GB, OK)

   Array B

      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS SSD, 1.6 TB, OK)

   Array C

      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS SSD, 1.6 TB, OK)

   Unassigned

      physicaldrive 2I:0:6 (port 2I:box 0:bay 6, SAS SSD, 1.6 TB, OK)
jakubgs commented 1 year ago

So according to the doc I have to run:

ssacli ctrl slot=0 create type=ld drives=2I:0:6 raid=0

Which should create new RAID 0 logical drive on controller slot 0, disk in port 2I, box 0, and bay 6.

jakubgs commented 1 year ago

If I look at details of Array A we can see that it has Fault Tolerance: 0:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo ssacli ctrl slot=0 ld 2 show detail

Smart Array P840ar in Slot 0 (Embedded)

   Array B

      Logical Drive: 2
         Size: 1.46 TB
         Fault Tolerance: 0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Unique Identifier: 600508B1001C8754D0E5B226B43349AF
         Disk Name: /dev/sdb 
         Mount Points: /mnt/sdb,/data 1.5 TB Partition Number 0
         Drive Type: Data
         LD Acceleration Method: Smart Path

Which I assume means it's RAID 0.

jakubgs commented 1 year ago

One thing I could do is to combine two 1.6 TB drives into one RAID 0 volume of 3.2 TB for the Geth nodes. That way I could avoid having to split things up into two /docker1 and /docker2 folders. The disadvantage of that is that it doubles the chances of the array failing due to just one of two NVMes. Although considering this is just a testnet we could live with that.

jakubgs commented 1 year ago

Yeah, so I can delete the array and now I have two unassigned drives:

admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo ssacli ctrl slot=0 ld 3 delete

Warning: Deleting an array can cause other array letters to become renamed.
         E.g. Deleting array A from arrays A,B,C will result in two remaining
         arrays A,B ... not B,C

Warning: Deleting the specified device(s) will result in data being lost.
         Continue? (y/n) y

admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo ssacli ctrl slot=0 pd all show     

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)

   Array B

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)

   Unassigned

      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS SSD, 1.6 TB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS SSD, 1.6 TB, OK)

And I can get a RAID 0 out of them.

jakubgs commented 1 year ago

And now that looks sensible:

admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo ssacli ctrl slot=0 create type=ld drives=1I:1:2,1I:1:3 raid=0

Warning: SSD Over Provisioning Optimization will be performed on the physical
         drives in this array. This process may take a long time and cause this
         application to appear unresponsive.
admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo ssacli ctrl slot=0 pd all show                               

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)

   Array B

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)

   Array C

      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS SSD, 1.6 TB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS SSD, 1.6 TB, OK)

And it is now visible via fdisk:

admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo fdisk -l /dev/sdc 
Disk /dev/sdc: 2.91 TiB, 3200575168512 bytes, 6251123376 sectors
Disk model: LOGICAL VOLUME  
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 262144 bytes / 524288 byte

I cleaned up the partition table just in case:

admin@linux-06.ih-eu-mda1.nimbus.prater:/ % sudo wipefs -a /dev/sdc
/dev/sdc: 2 bytes were erased at offset 0x00000438 (ext4): 53 ef
jakubgs commented 1 year ago

And after running:

 > ap ansible/bootstrap.yml -t role::bootstrap:volumes -l linux-06.ih-eu-mda1.nimbus.prater  

I have it mounted properly:

admin@linux-06.ih-eu-mda1.nimbus.prater:/ % df -h /docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        2.9T   28K  2.8T   1% /docker

Neat.

jakubgs commented 1 year ago

I've also bootstrapped 3 nimbus.geth hosts:

https://github.com/status-im/infra-nimbus/blob/0b7111e70894da25a0855fd3c29bae80d2b865d0/ansible/inventory/test#L4-L9