status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 6 forks source link

Deploy validator nodes on Hetzner server #52

Closed arthurk closed 3 years ago

arthurk commented 3 years ago

We bought a dedicated server from Hetzner (ax41-nvme) which we want to use for running validator nodes (see https://github.com/status-im/infra-nimbus/issues/45 for the research).

To do this we need to setup supporting infrastructure for the new provider first. This infra will run in Hetzner Cloud and the instances will be in the same data center as the dedicated server (Finland).

Specifically we need the following servers in hetzner cloud (https://www.hetzner.com/cloud):

The tasks are:

arthurk commented 3 years ago

The terraform module for hetzner-cloud is ready and the servers for the new datacenter in Hetzner Cloud are deployed and running. Next I'll work on the config for the dedicated server

arthurk commented 3 years ago

Next steps:

arthurk commented 3 years ago

The beacon node is now running on stable-metal-01.he-eu-hel1.nimbus.mainnet

Grafana dashboard: https://metrics.status.im/d/pgeNfj2Wz23/nimbus-fleet-testnets?orgId=1&var-instance=stable-metal-01.he-eu-hel1.nimbus.mainnet&from=now-15m&to=now

jakubgs commented 3 years ago

Considering the recent Hetzner announcement about "mining" I guess we shouldn't run mainnet nodes on that host, just testnets.

Also, is there any progress on running multiple nodes on that host?

arthurk commented 3 years ago

According to Zahary the beacon node would not qualify as "mining".

Running multiple beacon nodes (without validation) on that host has been tested and works well.

The layout requested is:

I think what we can do is to run 3 node per machine then, where each node will be using a different build/branch. The server index will correspond to the existing indices that we have. In other words server-01 will run stable-01, testing-01 and unstable-01, server-02 will run stable-02, testing-02, unstable-02 and so on

arthurk commented 3 years ago

We've decided to split the beacon-node ansible role into OS specific roles there's a new linux role at https://github.com/status-im/infra-role-beacon-node-linux. It will periodically pull changes from a branch, build it and then run it with systemd (previously it was run via docker).

I've applied the role on the Hetzner server and it's running the layout mentioned above (3 mainnet nodes: stable, unstable, testing). See https://github.com/status-im/infra-nimbus/pull/63

Since I'm going on vacation here's a few things that need to be done (for @jakubgs in case you want to continue with this):

jakubgs commented 3 years ago

Thanks for the notes.

The comment about the timer is outdated. It used to be that if Ansible fetched newest changes then the timer wouldn't pick them up to be built next time it runs. But since I've refactored the build script it should not be relevant anymore.

jakubgs commented 3 years ago

I've extracted the distribute-validators role into it's own repo: https://github.com/status-im/infra-role-dist-validators

By using it from the beacon-node role this allows it to be run for multiple nodes running on the same host in different folders.

Changes:

jakubgs commented 3 years ago

I also fixed the behavior of the role to run the timer instead of the script directly and did some fixes for the timer role:

jakubgs commented 3 years ago

And I deployed prater beacon nodes from 3 branches on the Hetzner node: https://github.com/status-im/infra-nimbus/commit/bdab0a2f

I'll let it sync and then move over the validators.

jakubgs commented 3 years ago

Found a bug in Consul service name that broke metrics scraping: https://github.com/status-im/infra-role-beacon-node-linux/commit/2ec11727

jakubgs commented 3 years ago

The metrics are there, but the dashboard is not suited to display metrics for multiple services on the same host:

image

jakubgs commented 3 years ago

I just noticed that the Hetzner host partition layout is kinda weird.

Here are the devices:

admin@metal-01.he-eu-hel1.nimbus.prater:/data % sudo lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme0n1     259:0    0   477G  0 disk  
├─nvme0n1p1 259:1    0    32G  0 part  
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme0n1p2 259:2    0   512M  0 part  
│ └─md1       9:1    0   511M  0 raid1 /boot
└─nvme0n1p3 259:3    0 444.4G  0 part  
  └─md2       9:2    0 444.3G  0 raid1 /
nvme1n1     259:4    0   477G  0 disk  
├─nvme1n1p1 259:5    0    32G  0 part  
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme1n1p2 259:6    0   512M  0 part  
│ └─md1       9:1    0   511M  0 raid1 /boot
└─nvme1n1p3 259:7    0 444.4G  0 part  
  └─md2       9:2    0 444.3G  0 raid1 /

And it appears there's 3 RAID volumes:

admin@metal-01.he-eu-hel1.nimbus.prater:/data % sudo fdisk -l /dev/md?
Disk /dev/md0: 31.99 GiB, 34325135360 bytes, 67041280 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/md1: 511 MiB, 535822336 bytes, 1046528 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/md2: 444.32 GiB, 477076193280 bytes, 931789440 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

But only two of them are mounted:

admin@metal-01.he-eu-hel1.nimbus.prater:/data % mount | grep /dev/md
/dev/md2 on / type ext4 (rw,relatime)
/dev/md1 on /boot type ext3 (rw,relatime)

And the 32 GB /dev/md0 is being used as SWAP:

admin@metal-01.he-eu-hel1.nimbus.prater:/data % cat /proc/swaps
Filename                Type        Size    Used    Priority
/dev/md0                                partition   33520636    0   -2

I don't think we even need swap on a how that has 64 GB of RAM, definitely not that much.

Truth be told it would make more sense to install the OS on the 32 GB partition and keep the md2 for Nimbus data.

jakubgs commented 3 years ago

I can't find any documentation on how the Ubuntu was installed on this host. We probably should have some.

jakubgs commented 3 years ago

There is this, but my guess would be that it does it's own partitioning without any input: image

jakubgs commented 3 years ago

I'm hitting issues with a nightly tag being force pushed to nimbus-eth2 repo:

TASK [infra-role-beacon-node-linux : Clone repo branch] ******************************************************
fatal: [metal-01.he-eu-hel1.nimbus.prater]: FAILED! => {
    "changed": false,
    "cmd": [
        "/usr/bin/git",
        "fetch",
        "--tags",
        "origin"
    ]
}

MSG:

Failed to download remote objects and refs:  From https://github.com/status-im/nimbus-eth2
 + 3392cffe...8f53b1a3 nim-libp2p-auto-bump-unstable -> origin/nim-libp2p-auto-bump-unstable  (forced update)
 ! [rejected]          nightly    -> nightly  (would clobber existing tag)

And this happens despite me using force: true: https://github.com/status-im/infra-role-beacon-node-linux/blob/376753c5/tasks/build.yml#L2-L9

Apparently this job deletes those every day: https://github.com/status-im/nimbus-eth2/blob/44f652f7/.github/workflows/nightly_build.yml#L275-L281

And git fetch --tags returns non-zero when it has to clobber a tag.

jakubgs commented 3 years ago

I've moved validators from 05 nodes to the Hetzner host in: https://github.com/status-im/infra-nimbus/commit/b5f75078

And got rid of the unnecessary 05 nodes in: https://github.com/status-im/infra-nimbus/commit/7d76f4b3

Looking good: image

jakubgs commented 3 years ago

It looks like the attestations are being sent:

admin@metal-01.he-eu-hel1.nimbus.prater:~ % for i in {0..3}; do curl -s localhost:920$i/metrics | grep beacon_attestations_sent_total; done
beacon_attestations_sent_total 252166.0
beacon_attestations_sent_total 252276.0
beacon_attestations_sent_total 254786.0
admin@metal-01.he-eu-hel1.nimbus.prater:~ % for i in {0..3}; do curl -s localhost:920$i/metrics | grep beacon_attestations_sent_total; done
beacon_attestations_sent_total 252573.0
beacon_attestations_sent_total 252574.0
beacon_attestations_sent_total 255083.0
jakubgs commented 3 years ago

The metrics dashboard can't handle multiple containers, but that's not part of this task:

image

So I'm considering this done.