status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 6 forks source link

Deploy Beacon Node on MacOS #60

Closed jakubgs closed 2 years ago

jakubgs commented 3 years ago

We need a MacOS host for Prater testnet nodes. The minimum hardware requirements:

It will run 3 instances of infra-role-beacon-node connected to the Prater testnet. Each instance will run a build from a different branch (unstable, testing, stable). The nodes will take over validators of the current Prater testnet nodes with 04 index (e.g. stable-04, testing-04, etc).

It should also build the newest version of respective branch daily.

Full Details: https://github.com/status-im/infra-nimbus/issues/58

jakubgs commented 3 years ago

This is a good lecture on MacOS packet filter(pf) firewall: https://www.youtube.com/watch?v=SOWQCAA8lZA

I found the slides: https://macadmins.psu.edu/files/2017/07/psumac2017-148-Packet-Filtering-Mac-OSX-Under-the-Hood-with-Apple-PF-2c3abhl.pdf

And his firewall setup: https://github.com/jhimes/PF-setup

jakubgs commented 3 years ago

I re-installed the host and re-deployed the nodes and so far the issue does not appear, but instead I'm seeing something new.

The metrics endpoint for stable node stopped responding during the night:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % curl -sv localhost:9200/health
*   Trying ::1...
* TCP_NODELAY set
* Connection failed
* connect to ::1 port 9200 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set

But when I call the REST API endpoint for the same node it works fine:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % curl -sS localhost:9300/eth/v1/node/version
{"data":{"version":"Nimbus/v1.5.0-f52efc-stateofus"}}

check the process ports they are all open, including 9200 for metrics:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo lsof -iTCP -sTCP:LISTEN -n -P | grep 49291
nimbus_be 49291 nimbus    3u  IPv4 0x12e16c957846c00b      0t0  TCP *:9200 (LISTEN)
nimbus_be 49291 nimbus   11u  IPv4 0x12e16c9578466bdb      0t0  TCP 127.0.0.1:9900 (LISTEN)
nimbus_be 49291 nimbus   12u  IPv4 0x12e16c9578464d93      0t0  TCP 127.0.0.1:9300 (LISTEN)
nimbus_be 49291 nimbus   17u  IPv4 0x12e16c95784597ab      0t0  TCP *:9000 (LISTEN)

Which makes no sense... why is MacOS such trash OS?

jakubgs commented 3 years ago

And I don't see any informative or related errors at the time the alert first showed up: 2021-10-08T02:23:23Z

{"lvl":"DBG","ts":"2021-10-08 02:20:01.075+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*5kxHjx","responseCode":1,"errMsg":"Failed to decompress snappy payload"}
{"lvl":"DBG","ts":"2021-10-08 02:20:01.075+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:20:11.003+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*ZiFi3u","responseCode":1,"errMsg":"Incomplete request"}
{"lvl":"DBG","ts":"2021-10-08 02:20:11.003+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"statusObj"}
{"lvl":"DBG","ts":"2021-10-08 02:21:04.862+01:00","msg":"A future has failed, enable trace logging for details","topics":"libp2p muxer","tid":322828,"file":"errors.nim:31","error":"TooManyConnectionsError"}
{"lvl":"DBG","ts":"2021-10-08 02:22:38.685+01:00","msg":"A future has failed, enable trace logging for details","topics":"libp2p connmanager","tid":322828,"file":"errors.nim:31","error":"LPStreamClosedError"}
{"lvl":"DBG","ts":"2021-10-08 02:22:42.899+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*2B5mj7","responseCode":1,"errMsg":"Incomplete request"}
{"lvl":"DBG","ts":"2021-10-08 02:22:42.899+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:23:14.514+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*2B5mj7","responseCode":1,"errMsg":"Failed to decompress snappy payload"}
{"lvl":"DBG","ts":"2021-10-08 02:23:14.514+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:23:42.471+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*2B5mj7","responseCode":1,"errMsg":"Incomplete request"}
{"lvl":"DBG","ts":"2021-10-08 02:23:42.471+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"statusObj"}
{"lvl":"DBG","ts":"2021-10-08 02:23:44.652+01:00","msg":"A future has failed, enable trace logging for details","topics":"libp2p muxer","tid":322828,"file":"errors.nim:31","error":"TooManyConnectionsError"}
{"lvl":"DBG","ts":"2021-10-08 02:24:37.651+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*QfQpvZ","responseCode":1,"errMsg":"Failed to decompress snappy payload"}
{"lvl":"DBG","ts":"2021-10-08 02:24:37.651+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:24:58.909+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*ZiFi3u","responseCode":1,"errMsg":"Incomplete request"}
{"lvl":"DBG","ts":"2021-10-08 02:24:58.909+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:25:16.223+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*2YTb8s","responseCode":1,"errMsg":"Failed to decompress snappy payload"}
{"lvl":"DBG","ts":"2021-10-08 02:25:16.223+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:25:53.319+01:00","msg":"A future has failed, enable trace logging for details","topics":"libp2p muxer","tid":322828,"file":"errors.nim:31","error":"TooManyConnectionsError"}
jakubgs commented 3 years ago

This is something worth noting:

...one of the drawbacks of modifying pf.conf directly is that macOS upgrades revert that file to its default contents (removing your custom rules). For example, in an upgrade from High Sierra (macOS 10.13.x) to Catalina (10.15.x), the following pf files were overwritten on my test Mac:

  • /etc/pf.conf
  • /etc/pf.anchors/com.apple

Custom anchors under /etc/pf.anchors/ were retained, but they were not especially useful since the references to them in pf.conf were overwritten!

https://blog.neilsabol.site/post/quickly-easily-adding-pf-packet-filter-firewall-rules-macos-osx/

jakubgs commented 3 years ago

Based on the information above I think it's better to create a separate /etc/firewall.conf file that will be loaded by a custom service.

This is a basic ruleset that I test and it works:

# Don't filter loopback interface.
set skip on lo0

# Allows SSH connections.
pass in quick proto tcp from any to any port 22

# Block all incoming traffic.
block in all

This allows for all local traffic and external SSH access while blocking anything else incoming.

This means this rule also blocks incoming traffic on the WireGuard VPN.

jakubgs commented 2 years ago

I've finished the new firewall setup:

Now all the configuration is located in /etc/firewall.conf and related files. More can be read in the readme.

Result:

 > sudo nmap -Pn -p22,8301,9000-9002,9200-9202 macos-01.ms-eu-dublin.nimbus.prater.statusim.net
Nmap scan report for macos-01.ms-eu-dublin.nimbus.prater.statusim.net (207.254.102.130)
Host is up (0.048s latency).

PORT     STATE    SERVICE
22/tcp   open     ssh
8301/tcp open     amberon
9000/tcp open     cslistener
9001/tcp open     tor-orport
9002/tcp open     dynamid
9200/tcp filtered wap-wsp
9201/tcp filtered wap-wsp-wtp
9202/tcp filtered wap-wsp-s

Nmap done: 1 IP address (1 host up) scanned in 14.27 seconds
jakubgs commented 2 years ago

I'm beginning to think the issue is memory related, since the host has only 8GB of RAM, and usual usage on Hetzner hosts is around 9GB if not counting cache:

image

But we don't have memory usage metrics for MacOS yet, since no Netdata setup was done.

jakubgs commented 2 years ago

I could try getting a bigger Mac mini but the next cancelation date is on 2021-10-22:

image

But the host costs $109, so it's not that bad: https://docs.macstadium.com/docs/how-do-i-end-or-cancel-my-subscriptio

jakubgs commented 2 years ago

I've deployed Netdata to the MacOS hosts:

It works:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % curl -s 'localhost:8001/api/v1/allmetrics?format=prometheus' | head
netdata_info{instance="macos-01.ms-eu-dublin.nimbus.prater",application="netdata",version="v1.29.3"} 1 1634051353246
netdata_host_tags_info{} 1 1634051353246
netdata_host_tags{} 1 1634051353246
netdata_disk_svctm_milliseconds_operation_average{chart="disk_svctm.disk0",family="disk0",dimension="svctm"} 0.1333720 1634051335000
netdata_disk_avgsz_KiB_operation_average{chart="disk_avgsz.disk0",family="disk0",dimension="reads"} 15.8251950 1634051335000
netdata_disk_avgsz_KiB_operation_average{chart="disk_avgsz.disk0",family="disk0",dimension="writes"} -39.6865200 1634051335000
netdata_disk_await_milliseconds_operation_average{chart="disk_await.disk0",family="disk0",dimension="reads"} 0.1755850 1634051335000
netdata_disk_await_milliseconds_operation_average{chart="disk_await.disk0",family="disk0",dimension="writes"} -0.1080440 1634051335000
netdata_job_size_KB_average{chart="cups.job_size",family="overview",dimension="pending"} 0.0000000 1634051335000
netdata_job_size_KB_average{chart="cups.job_size",family="overview",dimension="held"} 0.0000000 1634051335000

image

jakubgs commented 2 years ago

Created an issue about MacOS metrics and API being flaky: https://github.com/status-im/nimbus-eth2/issues/2984

jakubgs commented 2 years ago

Apparently by default MacOS does not give remote access to non-admin users: https://www.vinnie.work/blog/2020-12-26-why-so-hard-osx-ssh-access/

So they either have to be in admin group, or be given right to SSH into the host explicitly:

dseditgroup -o edit -n /Local/Default -u vinnie -p -a joe -t user com.apple.access_ssh
jakubgs commented 2 years ago

I adjusted the Ansible role to grant SSH access to users: https://github.com/status-im/infra-role-bootstrap-macos/commit/52273416

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo dscacheutil -q group -a name com.apple.access_ssh
name: com.apple.access_ssh
password: *
gid: 399
users: auto admin jakub petty zahary dustin mamy stefan dryajov kim giovanni tanguy cheatfate 

But I also added Nimbus team members to admin groups to make debugging easier: https://github.com/status-im/infra-nimbus/commit/7aa5d2d4

jakubgs commented 2 years ago

Also, I've received a response to my ticket about inaccessible DNS servers: https://portal.macstadium.com/tickets/140395

Our servers were:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % tail -n4 /etc/resolv.conf
nameserver 207.254.72.253
nameserver 207.254.72.254
nameserver 8.8.8.8
nameserver 8.8.4.4

But apparently:

Good afternoon. Could I propose that you change the DNS servers to the Dublin MacStadium DNS servers as upon further investigation it seems your Mini was incorrectly set with the Las Vegas MacStadium DNS servers. At your discretion, you may also want to configure your preferred Public DNS as a fall-back as a third DNS server.

MacStadium Dublin DNS1: 207.254.25.253 MacStadium Dublin DNS2: 207.254.25.254

Regards, Mark

Of course Apple has to re-invent everything, so I had to modify the DNS configuration using networksetup command:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % networksetup -getdnsservers Ethernet
207.254.72.253
207.254.72.254
8.8.8.8
8.8.4.4
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % networksetup -setdnsservers Ethernet 207.254.25.253 207.254.25.254 8.8.8.8 8.8.4.4 1.1.1.1
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % networksetup -getdnsservers Ethernet
207.254.25.253
207.254.25.254
8.8.8.8
8.8.4.4
1.1.1.1

So now it's correct end resolves queries fast.

jakubgs commented 2 years ago

To help with debugging I got SIP disabled via MacStadium support ticket: https://portal.macstadium.com/tickets/140647

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo csrutil status 

System Integrity Protection status: disabled. 
jakubgs commented 2 years ago

After investigation in https://github.com/status-im/nimbus-eth2/issues/2984 and thanks to suggestions from @stefantalpalaru it appears that the issue was indeed the file limit, which when set to a value higher than 65536 it is simple ignored by the OS, which is dumb beyond belief. And what's even dumber is that getrlimit() keeps reporting the fake limit for some reason.

Fixed in bootstrap: https://github.com/status-im/infra-role-bootstrap-macos/commit/a0ab4fb8

Looks like we'll be able to move validators to our new MacOS host next week.

jakubgs commented 2 years ago

I also tested changing limits in /etc/sysctl.conf, but as far as I saw it had no effect:

kern.maxfiles=20000000
kern.maxfilesperproc=20000000
jakubgs commented 2 years ago

As suggested by Zah I've lowered the max peers limit to 200 to avoid hitting the open files limit: https://github.com/status-im/infra-nimbus/commit/5f04e4b1

jakubgs commented 2 years ago

I've migrated the validators from AWS prater 02 nodes to the MacOS host:

It appears to be working fine:

image

admin@macos-01.ms-eu-dublin.nimbus.prater:/Users/nimbus % grep 'Attestation sent' beacon-node-*/logs/service.log | wc -l
    7019
jakubgs commented 2 years ago

I'd like to get rid of all 3 02 AWS nodes, but I can't because one of them is being used as Prater bootstrap node: https://github.com/status-im/infra-nimbus/blob/961756674c6d3e7f3b512f166a59733a1a177899/ansible/host_vars/stable-large-02.aws-eu-central-1a.nimbus.prater.yml#L1-L2

But my idea is to get rid of both testing nodes and then:

testing-large-01.aws-eu-central-1a.nimbus.prater
testing-large-02.aws-eu-central-1a.nimbus.prater

Rename stable-large-02.aws-eu-central-1a.nimbus.prater to testing-large-01.aws-eu-central-1a.nimbus.prater.

This way we can keep it without changing the IP or node ID. I just have to make sure the port stays the same.

jakubgs commented 2 years ago

It is done: https://github.com/status-im/infra-nimbus/commit/b5684f01

Changes to Outputs:
  ~ hosts = {
      - stable-large-02.aws-eu-central-1a.nimbus.prater     = "3.65.99.236" -> null
      ~ testing-large-01.aws-eu-central-1a.nimbus.prater    = "18.195.151.76" -> "3.65.99.236"
        # (26 unchanged elements hidden)
    }

Before:

 > dig +short stable-large-01.aws-eu-central-1a.nimbus.prater.statusim.net
3.120.202.252
 > dig +short stable-large-02.aws-eu-central-1a.nimbus.prater.statusim.net
3.65.99.236
 > dig +short testing-large-01.aws-eu-central-1a.nimbus.prater.statusim.net
18.195.151.76

Now:

 > dig +short stable-large-01.aws-eu-central-1a.nimbus.prater.statusim.net
3.120.202.252
 > dig +short testing-large-01.aws-eu-central-1a.nimbus.prater.statusim.net
3.65.99.236

The node key and port was preserved.

I will monitor the macos-01 host over the weekend and if everything goes well this issue can be closed.

jakubgs commented 2 years ago

As far as I can see all three nodes on the three main branches work fine:

image

I consider this task done.