Closed arthurk closed 3 years ago
The terraform module for hetzner-cloud is ready and the servers for the new datacenter in Hetzner Cloud are deployed and running. Next I'll work on the config for the dedicated server
Next steps:
stable-metal-01
to show that it's a hardware hostThe beacon node is now running on stable-metal-01.he-eu-hel1.nimbus.mainnet
Grafana dashboard: https://metrics.status.im/d/pgeNfj2Wz23/nimbus-fleet-testnets?orgId=1&var-instance=stable-metal-01.he-eu-hel1.nimbus.mainnet&from=now-15m&to=now
Considering the recent Hetzner announcement about "mining" I guess we shouldn't run mainnet nodes on that host, just testnets.
Also, is there any progress on running multiple nodes on that host?
According to Zahary the beacon node would not qualify as "mining".
Running multiple beacon nodes (without validation) on that host has been tested and works well.
The layout requested is:
I think what we can do is to run 3 node per machine then, where each node will be using a different build/branch. The server index will correspond to the existing indices that we have. In other words server-01 will run stable-01, testing-01 and unstable-01, server-02 will run stable-02, testing-02, unstable-02 and so on
We've decided to split the beacon-node ansible role into OS specific roles there's a new linux role at https://github.com/status-im/infra-role-beacon-node-linux. It will periodically pull changes from a branch, build it and then run it with systemd (previously it was run via docker).
I've applied the role on the Hetzner server and it's running the layout mentioned above (3 mainnet nodes: stable, unstable, testing). See https://github.com/status-im/infra-nimbus/pull/63
Since I'm going on vacation here's a few things that need to be done (for @jakubgs in case you want to continue with this):
stable-metal-01.he-eu-hel1.nimbus.mainnet
and remove the stable-
partdistribute-validators
role assumes that the process is run in a docker container and needs to be changed/data
and docker
) that can be removedThanks for the notes.
The comment about the timer is outdated. It used to be that if Ansible fetched newest changes then the timer wouldn't pick them up to be built next time it runs. But since I've refactored the build script it should not be relevant anymore.
I've extracted the distribute-validators
role into it's own repo: https://github.com/status-im/infra-role-dist-validators
By using it from the beacon-node
role this allows it to be run for multiple nodes running on the same host in different folders.
Changes:
I also fixed the behavior of the role to run the timer instead of the script directly and did some fixes for the timer role:
And I deployed prater beacon nodes from 3 branches on the Hetzner node: https://github.com/status-im/infra-nimbus/commit/bdab0a2f
I'll let it sync and then move over the validators.
Found a bug in Consul service name that broke metrics scraping: https://github.com/status-im/infra-role-beacon-node-linux/commit/2ec11727
The metrics are there, but the dashboard is not suited to display metrics for multiple services on the same host:
I just noticed that the Hetzner host partition layout is kinda weird.
Here are the devices:
admin@metal-01.he-eu-hel1.nimbus.prater:/data % sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 477G 0 disk
├─nvme0n1p1 259:1 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─nvme0n1p2 259:2 0 512M 0 part
│ └─md1 9:1 0 511M 0 raid1 /boot
└─nvme0n1p3 259:3 0 444.4G 0 part
└─md2 9:2 0 444.3G 0 raid1 /
nvme1n1 259:4 0 477G 0 disk
├─nvme1n1p1 259:5 0 32G 0 part
│ └─md0 9:0 0 32G 0 raid1 [SWAP]
├─nvme1n1p2 259:6 0 512M 0 part
│ └─md1 9:1 0 511M 0 raid1 /boot
└─nvme1n1p3 259:7 0 444.4G 0 part
└─md2 9:2 0 444.3G 0 raid1 /
And it appears there's 3 RAID volumes:
admin@metal-01.he-eu-hel1.nimbus.prater:/data % sudo fdisk -l /dev/md?
Disk /dev/md0: 31.99 GiB, 34325135360 bytes, 67041280 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/md1: 511 MiB, 535822336 bytes, 1046528 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/md2: 444.32 GiB, 477076193280 bytes, 931789440 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
But only two of them are mounted:
admin@metal-01.he-eu-hel1.nimbus.prater:/data % mount | grep /dev/md
/dev/md2 on / type ext4 (rw,relatime)
/dev/md1 on /boot type ext3 (rw,relatime)
And the 32 GB /dev/md0
is being used as SWAP:
admin@metal-01.he-eu-hel1.nimbus.prater:/data % cat /proc/swaps
Filename Type Size Used Priority
/dev/md0 partition 33520636 0 -2
I don't think we even need swap on a how that has 64 GB of RAM, definitely not that much.
Truth be told it would make more sense to install the OS on the 32 GB partition and keep the md2
for Nimbus data.
I can't find any documentation on how the Ubuntu was installed on this host. We probably should have some.
There is this, but my guess would be that it does it's own partitioning without any input:
I'm hitting issues with a nightly
tag being force pushed to nimbus-eth2
repo:
TASK [infra-role-beacon-node-linux : Clone repo branch] ******************************************************
fatal: [metal-01.he-eu-hel1.nimbus.prater]: FAILED! => {
"changed": false,
"cmd": [
"/usr/bin/git",
"fetch",
"--tags",
"origin"
]
}
MSG:
Failed to download remote objects and refs: From https://github.com/status-im/nimbus-eth2
+ 3392cffe...8f53b1a3 nim-libp2p-auto-bump-unstable -> origin/nim-libp2p-auto-bump-unstable (forced update)
! [rejected] nightly -> nightly (would clobber existing tag)
And this happens despite me using force: true
:
https://github.com/status-im/infra-role-beacon-node-linux/blob/376753c5/tasks/build.yml#L2-L9
Apparently this job deletes those every day: https://github.com/status-im/nimbus-eth2/blob/44f652f7/.github/workflows/nightly_build.yml#L275-L281
And git fetch --tags
returns non-zero when it has to clobber a tag.
I've moved validators from 05
nodes to the Hetzner host in: https://github.com/status-im/infra-nimbus/commit/b5f75078
And got rid of the unnecessary 05
nodes in: https://github.com/status-im/infra-nimbus/commit/7d76f4b3
Looking good:
It looks like the attestations are being sent:
admin@metal-01.he-eu-hel1.nimbus.prater:~ % for i in {0..3}; do curl -s localhost:920$i/metrics | grep beacon_attestations_sent_total; done
beacon_attestations_sent_total 252166.0
beacon_attestations_sent_total 252276.0
beacon_attestations_sent_total 254786.0
admin@metal-01.he-eu-hel1.nimbus.prater:~ % for i in {0..3}; do curl -s localhost:920$i/metrics | grep beacon_attestations_sent_total; done
beacon_attestations_sent_total 252573.0
beacon_attestations_sent_total 252574.0
beacon_attestations_sent_total 255083.0
The metrics dashboard can't handle multiple containers, but that's not part of this task:
So I'm considering this done.
We bought a dedicated server from Hetzner (ax41-nvme) which we want to use for running validator nodes (see https://github.com/status-im/infra-nimbus/issues/45 for the research).
To do this we need to setup supporting infrastructure for the new provider first. This infra will run in Hetzner Cloud and the instances will be in the same data center as the dedicated server (Finland).
Specifically we need the following servers in hetzner cloud (https://www.hetzner.com/cloud):
The tasks are: