status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 6 forks source link

Deploy Beacon Node on MacOS #60

Closed jakubgs closed 2 years ago

jakubgs commented 3 years ago

We need a MacOS host for Prater testnet nodes. The minimum hardware requirements:

It will run 3 instances of infra-role-beacon-node connected to the Prater testnet. Each instance will run a build from a different branch (unstable, testing, stable). The nodes will take over validators of the current Prater testnet nodes with 04 index (e.g. stable-04, testing-04, etc).

It should also build the newest version of respective branch daily.

Full Details: https://github.com/status-im/infra-nimbus/issues/58

jakubgs commented 3 years ago

We currently use MacStadium for MacOS hosts for our CI: https://github.com/status-im/infra-ci/tree/master/modules/mac-stadium

They do provide the new Mini M1 for 109 USD per month, which isn't the worst price: image

https://www.macstadium.com/configure?p=minig5invite

jakubgs commented 3 years ago

Some things to keep in mind:

The priority is to get it working. We can look into other things later if necessary.

arthurk commented 3 years ago

I started working on this. We've bought the Gen 5 Mac Mini with M1 CPU and it's ready to use.

Currently I'm merging the existing macos roles from infra-ci into a new infra-role-bootstrap-macos role. One thing to keep in mind is that the roles are made for macOS 11 and might need some tweaking for macOS 12.

After that the plan is to create a infra-role-beacon-node-macos role which installs and runs the beacon node. We can then use it in the infra-nimbus fleet. Overall similar to the -windows roles but with macos instead.

jakubgs commented 3 years ago

@arthurk can you update this issue on any progress. I don't see infra-role-bootstrap-macos nor do I see any commits in infra-role-beacon-node-macos. Are you committing and just not pushing? I recommend pushing smaller chunks even if not functional initially.

arthurk commented 3 years ago

Not much progress, I was working on the bridges for vac and the new deployment for chat2bridge

arthurk commented 3 years ago

I've created the infra-role-bootstrap-macos repo (private) with the initial files. I'm testing it on the new host (the tasks were written for macOS 10 and the new host has macOS 11)

jakubgs commented 3 years ago

How is the progress? Are you stuck on anything?

arthurk commented 3 years ago

Good progress, I've been running the process for a while and it's working well. I'm now working on distributing the validators. I'll also look a bit into setting up a firewall, it's disabled by default on the hosts but I'll need to check how to set it up via cli.

arthurk commented 3 years ago

The infra-role-beacon-node-macos ansible role is almost finished. I've updated the readme with important info, the node is running, periodic builds and log rotation work fine. I'm currently working on integrating infra-role-dist-validators to distribute the secrets/validators for each node. It's almost identical to the linux config, so there shouldn't be any bigger issues with it.

There is one problem with the sudoers file where I can't get it to work for regular user to login as the nimbus user. It doesn't seem high prio so I'll fix it later on.

As for our bootstrap-macos role there are many things that can be optimized since macOS is not meant for servers. There are processes like SafariBookmarkSync and ParentalControl running for which there is no need. The OS has a firewall but it will need some work to figure out how to control it via CLI/Ansible. The role works fine right now but there's definitely room for improvement.

arthurk commented 3 years ago

Quick update on where things are. I've included the config for prater in infra-nimbus at https://github.com/status-im/infra-nimbus/pull/68 and ran the full playbook for a new deployment (prater, unstable) with almost no problems. Build time is 3 minutes which is nice compared to other machines.

There was one issue when ansible was running the first build as part of the playbook run and failed with CMake not installed. Aborting. After re-running the ansible playbook it worked again. This might be an edge-case for new deployments. I suspect something is wrong with the launchd config, if we can't figure out why it's happening we can just call the build.sh script in ansible instead of using the launchd module.

Otherwise some small issues I came across:

I've updated the readme and added more info at https://github.com/status-im/infra-role-beacon-node-macos

jakubgs commented 3 years ago

Running the playbook with tags doesn't do anything

Because tags have to be explicitly defined in the Playbook, and we only include beacon-node: https://github.com/status-im/infra-nimbus/blob/4f05e2f40dddd17ecbe4ea630b357b44f0bffd82/ansible/prater.yml#L44-L45 https://github.com/status-im/infra-nimbus/blob/4f05e2f40dddd17ecbe4ea630b357b44f0bffd82/ansible/prater.yml#L68-L69

arthurk commented 3 years ago

Still having a problem with CMake not installed. Aborting. during deployment when running the build script.

Problem is that the PATH is wrong when ansible runs the task:

# ssh on server as nimbus user
echo $PATH
/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/homebrew/bin:/Library/Apple/usr/bin

# ansible "echo $PATH" command
/usr/bin:/bin:/usr/sbin:/sbin

cmake is in /opt/homebrew/bin. Will debug this more on Monday.

arthurk commented 3 years ago

I've checked the the problem again and currently don't know what the issue could be.

Right after running the ansible playbook I'm starting the build in launchd:

sudo launchctl start status.beacon-node-prater-stable-build

which leads to the CMake not installed. Aborting. error. Which makes sense, when printing the PATH it shows:

/usr/bin:/bin:/usr/sbin:/sbin

When I manually launch the build script as the nimbus user (./build.sh) it works as expected since the user has /opt/homebrew/bin in the path.

After that I start the same launchctl job as above and it suddenly works. When I print the path it still shows PATH=/usr/bin:/bin:/usr/sbin:/sbin but it can find Cmake and does the build successfully.

So now I'm trying to figure out why the launchd job works after the build.sh script has been run, but not before that.

arthurk commented 3 years ago

The first build in a freshly cloned repo will pull all submodules and build the nim compiler which will fail since it's not in the PATH. But after the build was triggered manually by the nimbus user (with cmake in the PATH) all subsequent builds don't require cmake anymore as the already build nim compiler will not be rebuilt.

So what I learned here is that building nimbus-eth2 doesn't actually require cmake. Only building the nim compiler requires it, which is only compiled once and not rebuilt as part of the whole beacon node build process.

Sourcing the /etc/profile before starting the build will set the correct path and make the build work correctly on first run.

jakubgs commented 3 years ago

I saw this change: https://github.com/status-im/infra-nimbus/commit/92cfc833e

https://github.com/status-im/infra-nimbus/blob/92cfc833e46f0067527a7e26d5541ad63f2a7e9d/ansible/prater.yml#L104-L107

In which you were changing the validator layout and I just want to make sure you understand what you are doing, and that you cannot deploy the same validators to two or more different hosts or they will get slashed.

The validators you listed there are currently being used by stable-large-01.aws-eu-central-1a.nimbus.prater: https://github.com/status-im/infra-nimbus/blob/d7e0530d978acc5df16ac2e454f38792893efac4/ansible/group_vars/nimbus.prater.yml#L23

If you deploy that change along with the validators they will get slashed and will not work anymore.

jakubgs commented 3 years ago

Also, I could not find the IP of the new macos host in your branch, nor could I access it:

 > ssh admin@207.254.102.130
(admin@207.254.102.130) Password:

I took at look at the bootstrap role and for some reason it doesn't include me: https://github.com/status-im/infra-role-bootstrap-macos/blob/24b32c3681efee40129542e0809281d71059ba09/defaults/main.yml#L15-L16

And the variables do not follow the bootstrap__ naming pattern.

arthurk commented 3 years ago

I saw this change: 92cfc83

https://github.com/status-im/infra-nimbus/blob/92cfc833e46f0067527a7e26d5541ad63f2a7e9d/ansible/prater.yml#L104-L107

In which you were changing the validator layout and I just want to make sure you understand what you are doing, and that you cannot deploy the same validators to two or more different hosts or they will get slashed.

The validators you listed there are currently being used by stable-large-01.aws-eu-central-1a.nimbus.prater:

https://github.com/status-im/infra-nimbus/blob/d7e0530d978acc5df16ac2e454f38792893efac4/ansible/group_vars/nimbus.prater.yml#L23

If you deploy that change along with the validators they will get slashed and will not work anymore.

I was only using this for testing. The PR is still in "draft" mode and has a "wip" in the title, no need to review it now

jakubgs commented 3 years ago

Add a user for me like in other bootstrap roles. I want to take a look.

arthurk commented 3 years ago

I've added a user for you in https://github.com/status-im/infra-role-bootstrap-macos/commit/6ffc9ed5e22e4a44eb25c10d46bc7a4620f80f8d

jakubgs commented 3 years ago

Now that https://github.com/status-im/infra-nimbus/pull/68 is merged we'll need two more things to get metrics:

Since you have only one week left I'd like you to work at least on the Consul agent config in the bootstrap role, similar to how it's done in the Linux and Windows ones.

jakubgs commented 3 years ago

I've fixed the MacOS PR for Consul agent service: https://github.com/status-im/infra-role-bootstrap-macos/pull/1

And deployed the change adjusting Consul data center to be he-eu-hel1: https://github.com/status-im/infra-nimbus/commit/67c7eff4

admin@node-01.he-eu-hel1.consul.hq:~ % consul members | grep prater
macos-01.ms-eu-dublin.nimbus.prater  207.254.102.130:8301  alive   client  1.10.1  3         he-eu-hel1  <default>
metal-01.he-eu-hel1.nimbus.prater    65.21.73.183:8301     alive   client  1.10.1  3         he-eu-hel1  <default>
metal-02.he-eu-hel1.nimbus.prater    65.108.5.45:8301      alive   client  1.10.1  3         he-eu-hel1  <default>

Also added some docs: https://github.com/status-im/infra-role-bootstrap-macos/commit/e0332cd3

jakubgs commented 3 years ago

Good article on launchctl usage and a docs page:

jakubgs commented 3 years ago

I had to do a bunch of other fixes to the role:

jakubgs commented 3 years ago

These are useful resources for WireGuard setup on MacOS:

jakubgs commented 3 years ago

I have WireGuard pretty much working, but I have one issue. I can't ping the VPN interface locally:

admin@macos-01.ms-eu-dublin.ci.misc:~ % ping -c3 10.14.0.27
PING 10.14.0.27 (10.14.0.27): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1

--- 10.14.0.27 ping statistics ---
3 packets transmitted, 0 packets received, 100.0% packet loss

Which apparently is an issue with how wireguard-go uses the utun type virtual interface, based on these issues:

jakubgs commented 3 years ago

We can see how the interface is created in /var/log/wireguard.log:

[#] wireguard-go utun
[+] Interface for wg0 is utun0
[#] wg setconf utun0 /dev/fd/63
[#] ifconfig utun0 inet 10.14.0.27 10.14.0.27 alias
[#] ifconfig utun0 up

And we can see that it has the 10.14.0.27 address:

 % sudo ifconfig utun0                                         
utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1420
    inet 10.14.0.27 --> 10.14.0.27 netmask 0xff000000 
jakubgs commented 3 years ago

As suggested in some issues I tried adding a rule like that to /etc/pf.anchors/com.wireguard:

rdr on utun0 from any to 10.14.0.27 -> lo0

But the issues persists. I might just change the healthcheck to ping something else for now, because that does work:

admin@macos-01.ms-eu-dublin.ci.misc:~ % ping -c3 10.14.0.1 
PING 10.14.0.1 (10.14.0.1): 56 data bytes
64 bytes from 10.14.0.1: icmp_seq=0 ttl=64 time=88.552 ms
64 bytes from 10.14.0.1: icmp_seq=1 ttl=64 time=43.771 ms
64 bytes from 10.14.0.1: icmp_seq=2 ttl=64 time=43.849 ms

--- 10.14.0.1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 43.771/58.724/88.552/21.092 ms
jakubgs commented 3 years ago

This doesn't solve the ping issue, but it does make for a decent healthcheck for Consul:

 % ping -c3 -b utun0 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.051 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.152 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.132 ms

--- 127.0.0.1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.051/0.112/0.152/0.044 ms

Instead of pinging the WireGuard IP I ping localhost from the WireGuard virtual interface, which does verify it's up.

jakubgs commented 3 years ago

The script in wireguard-tools by defaults creates a file at /var/run/wireguard/wg0.name with interface name.

This is necessary because:

Since the utun driver cannot have arbitrary interface names, you must either use utun[0-9]+ for an explicit interface name or utun to have the kernel select one for you. If you choose utun as the interface name, and the environment variable WG_TUN_NAME_FILE is defined, then the actual name of the interface chosen by the kernel is written to the file specified by that variable.

https://github.com/WireGuard/wireguard-go#macos

jakubgs commented 3 years ago

Here's the implementation of WireGuard setup on MacOS: https://github.com/status-im/infra-role-wireguard/commit/13f56f76

jakubgs commented 3 years ago

There's something weird about nodes on MacOS, because they are not listening on the LibP2P TCP ports:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo nmap -Pn -p9000-9002 localhost
Host is up (0.00011s latency).
Other addresses for localhost (not scanned): ::1

PORT     STATE  SERVICE
9000/tcp closed cslistener
9001/tcp closed tor-orport
9002/tcp closed dynamid

Nmap done: 1 IP address (1 host up) scanned in 0.05 seconds

But I can see the UDP ports are being listened on:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo lsof -iUDP -n -P | grep nimbus
nimbus_be 84441         nimbus   16u  IPv4 0xcdf0ae1fd8ec71a9      0t0  UDP *:9000
nimbus_be 84447         nimbus   16u  IPv4 0xcdf0ae1fd8ec7789      0t0  UDP *:9001
nimbus_be 84455         nimbus   16u  IPv4 0xcdf0ae1fd8ec8059      0t0  UDP *:9002

But for TCP I can only see the RPC/REST/metrics ports open:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo lsof -iTCP -sTCP:LISTEN -n -P | grep nimbus
nimbus_be 84441 nimbus    3u  IPv4 0xcdf0ae1fec04f219      0t0  TCP *:9200 (LISTEN)
nimbus_be 84441 nimbus   11u  IPv4 0xcdf0ae1fec769de9      0t0  TCP 127.0.0.1:9900 (LISTEN)
nimbus_be 84441 nimbus   12u  IPv4 0xcdf0ae1fec050649      0t0  TCP 127.0.0.1:9300 (LISTEN)
nimbus_be 84447 nimbus    3u  IPv4 0xcdf0ae1febf3dde9      0t0  TCP *:9201 (LISTEN)
nimbus_be 84447 nimbus   11u  IPv4 0xcdf0ae1fecbdd061      0t0  TCP 127.0.0.1:9901 (LISTEN)
nimbus_be 84447 nimbus   12u  IPv4 0xcdf0ae1febfa9de9      0t0  TCP 127.0.0.1:9301 (LISTEN)
nimbus_be 84455 nimbus    3u  IPv4 0xcdf0ae1fecbd9de9      0t0  TCP *:9202 (LISTEN)
nimbus_be 84455 nimbus   11u  IPv4 0xcdf0ae1febd69de9      0t0  TCP 127.0.0.1:9902 (LISTEN)
nimbus_be 84455 nimbus   12u  IPv4 0xcdf0ae1fec764649      0t0  TCP 127.0.0.1:9302 (LISTEN)
jakubgs commented 3 years ago

Host reboot fixed that, but weird:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo nmap -Pn -p9000-9002 localhost
Host is up (0.00013s latency).
Other addresses for localhost (not scanned): ::1

PORT     STATE SERVICE
9000/tcp open  cslistener
9001/tcp open  tor-orport
9002/tcp open  dynamid
jakubgs commented 3 years ago

One issue I identified is that if Application Firewall is enabled WireGuard does not accept new connections.

It can be enabled/disabled with these commands:

 > sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
Firewall is disabled. (State = 0)
 > sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate on
Firewall is enabled. (State = 1) 
 > sudo /usr/libexec/ApplicationFirewall/socketfilterfw --getglobalstate
Firewall is enabled. (State = 1)

But you can also add individual applications:

 > sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /opt/homebrew/bin/wg-quick
Application at path ( /opt/homebrew/bin/wg-quick ) added to firewall 
 > sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /opt/homebrew/bin/wireguard-go
Application at path ( /opt/homebrew/bin/wireguard-go ) added to firewall 

But as we can see the commands read the symlinks and add specific versions:

 > sudo /usr/libexec/ApplicationFirewall/socketfilterfw --listapps                          
ALF: total number of apps = 3 

1 :  /System/Library/CoreServices/RemoteManagement/ARDAgent.app 
     ( Allow incoming connections ) 

2 :  /opt/homebrew/Cellar/wireguard-tools/1.0.20210914/bin/wg-quick 
     ( Allow incoming connections ) 

3 :  /opt/homebrew/Cellar/wireguard-go/0.0.20210424/bin/wireguard-go 
     ( Allow incoming connections ) 

So this has to be repeated after every upgrade to the WireGuard packages.

jakubgs commented 3 years ago

It looks like application firewall is disabled by default on MacStadium MacOS hosts, for example macos-02 in our CI:

administrator@macos-02.ms-eu-dublin.ci.misc:~ % sudo /usr/libexec/ApplicationFirewall/socketfilterfw --getglobalstate
Firewall is disabled. (State = 0)
administrator@macos-02.ms-eu-dublin.ci.misc:~ % sudo /usr/libexec/ApplicationFirewall/socketfilterfw --getloggingmode
Log mode is on 
administrator@macos-02.ms-eu-dublin.ci.misc:~ % sudo /usr/libexec/ApplicationFirewall/socketfilterfw --getstealthmode
Stealth mode disabled 
jakubgs commented 3 years ago

This is interesting. As opposed to Linux, on MacOS if you run nmap on a port that's not being used you get closed:

 > sudo nmap -Pn -p8080 macos-01.ms-eu-dublin.nimbus.prater.statusim.net
Nmap scan report for macos-01.ms-eu-dublin.nimbus.prater.statusim.net (207.254.102.130)

PORT     STATE  SERVICE
8080/tcp closed http-proxy

Bur if I start a netcat server on the host on port 8080:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo nc -l 0.0.0.0 8080   

It starts appearing as filtered:

 > sudo nmap -Pn -p8080 macos-01.ms-eu-dublin.nimbus.prater.statusim.net
Nmap scan report for macos-01.ms-eu-dublin.nimbus.prater.statusim.net (207.254.102.130)

PORT     STATE    SERVICE
8080/tcp filtered http-proxy

Which doesn't make sense, since it should appear as filtered in both cases, but okay...

jakubgs commented 3 years ago

Another weird thing. You need to send SIGHUP to the socketfilterfw process for changes to take effect:

 > sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
Firewall is disabled. (State = 0)
 > sudo pkill -HUP socketfilterfw  

Otherwise the changes do not take effect. Found that out in this repo with various security articles.

jakubgs commented 3 years ago

It appears that if we want to have application firewall enabled then we can't use symlinks for beacon node binaries:

admin@macos-01.ms-eu-dublin.nimbus.prater:/Library/LaunchDaemons % sudo /usr/libexec/ApplicationFirewall/socketfilterfw --listapps
ALF: total number of apps = 7 

1 :  /System/Library/CoreServices/RemoteManagement/ARDAgent.app 
     ( Allow incoming connections ) 

2 :  /Users/nimbus/beacon-node-prater-stable/repo/build/nimbus_beacon_node_f52efc0c 
     ( Allow incoming connections ) 

Because that confuses the firewall, since the process appears to use the symlink, the the binary it points to:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % ps -ax | grep prater-stable 
  143 ??         0:07.47 /Users/nimbus/beacon-node-prater-stable/repo/build/nimbus_beacon_node --network=prater --data-dir=/Users/nimbus/beacon-node-prater-stable/data/shared_prater_0 --web3-url=wss://goerli.infura.io/ws/v3/6224f3c792cc443fafb64e70a98f871e --nat=extip:207.254.102.130 --log-level=DEBUG --tcp-port=9000 --udp-port=9000 --max-peers=300 --num-threads=1 --netkey-file=/Users/nimbus/beacon-node-prater-stable/data/netkey --slashing-db-kind=v2 --insecure-netkey-password=true --subscribe-all-subnets=false --doppelganger-detection=true --rpc=true --rpc-address=127.0.0.1 --rpc-port=9900 --rest=true --rest-address=127.0.0.1 --rest-port=9300 --metrics=true --metrics-address=0.0.0.0 --metrics-port=9
jakubgs commented 3 years ago

I don't get this firewall at all. I rebooted the host and now all Libp2p ports are available:

 > sudo nmap -Pn -sT -p9000-9002 macos-01.ms-eu-dublin.nimbus.prater.statusim.net
Nmap scan report for macos-01.ms-eu-dublin.nimbus.prater.statusim.net (207.254.102.130)
Host is up (0.046s latency).

PORT     STATE SERVICE
9000/tcp open  cslistener
9001/tcp open  tor-orport
9002/tcp open  dynamid

But so are the metrics ones, which I didn't enable:

 > sudo nmap -Pn -sT -p9200-9202 macos-01.ms-eu-dublin.nimbus.prater.statusim.net
Nmap scan report for macos-01.ms-eu-dublin.nimbus.prater.statusim.net (207.254.102.130)
Host is up (0.046s latency).

PORT     STATE SERVICE
9200/tcp open  wap-wsp
9201/tcp open  wap-wsp-wtp
9202/tcp open  wap-wsp-s
jakubgs commented 3 years ago

Adding the binaries does work, but it requires the process restart to take effect:

PORT     STATE    SERVICE
9200/tcp filtered wap-wsp
 % sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /Users/nimbus/beacon-node-prater-stable/repo/build/nimbus_beacon_node_f52efc0c   
Application at path ( /Users/nimbus/beacon-node-prater-stable/repo/build/nimbus_beacon_node_f52efc0c ) added to firewall 
PORT     STATE    SERVICE
9200/tcp filtered wap-wsp
 % sudo launchctl unload status.beacon-node-prater-stable.plist
 % sudo launchctl load status.beacon-node-prater-stable.plist
PORT     STATE SERVICE
9200/tcp open  wap-wsp
jakubgs commented 3 years ago

Also, it looks like Consul is not getting up after reboot:

admin@macos-01.ms-eu-dublin.nimbus.prater:/Library/LaunchDaemons % tail -n5 /var/log/consul/consul.log 
2021-10-07T12:17:02.770+0100 [INFO]  agent: Stopping server: address=[::]:8500 network=tcp protocol=http
2021-10-07T12:17:02.771+0100 [INFO]  agent: Waiting for endpoints to shut down
2021-10-07T12:17:02.771+0100 [INFO]  agent: Endpoints down
2021-10-07T12:17:02.771+0100 [INFO]  agent: Exit code: code=0
==> system allows a max of 256 file descriptors, but limits.http_max_conns_per_client: 500 needs at least 520
jakubgs commented 3 years ago

It appears the issue is that consul is being started before our limit.maxfiles.plist command is run that increases the file limit.

We can try to mitigate this by using the OtherJobEnabled Launchd parameters as suggested in this answer.

jakubgs commented 3 years ago

Looks like just adding config to make the service restart if failed works: https://github.com/status-im/infra-role-bootstrap-macos/commit/19210d30

  <key>KeepAlive</key>
  <dict>
    <key>SuccessfulExit</key>
    <false/>
  </dict>

Result:

==> system allows a max of 256 file descriptors, but limits.http_max_conns_per_client: 500 needs at least 520
==> Starting Consul agent...
           Version: '1.10.1'
           Node ID: 'cba364c3-44be-e81b-1071-8c26c9baf29f'
         Node name: 'macos-01.ms-eu-dublin.nimbus.prater'
...
jakubgs commented 3 years ago

Now I'm getting a restart loop due to Nimbus thinking the UDP libp2p discovery port is in use:

{"lvl":"INF","ts":"2021-10-07 12:58:11.765+01:00","msg":"Starting discovery node","topics":"discv5","tid":11466,"file":"protocol.nim:935","node":"1b*296ec8:207.254.102.130:9000","bindAddress":{"ip":"0.0.0.0","port":9000}}
{"lvl":"FAT","ts":"2021-10-07 12:58:11.765+01:00","msg":"Failed to start discovery service. UDP port may be already in use","topics":"networking","tid":11466,"file":"eth2_network.nim:1383"}

When it clearly is NOT in use:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo lsof -nP -iUDP | grep 9000                                 
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % 

Appears to come from here.

jakubgs commented 3 years ago

What the hell is happening? lsof doesn't show the ports being used but netstat does:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo netstat -lvn | grep -E '(pid|9000)' 
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)     rhiwat shiwat    pid   epid  state    options
udp4   97354      0  *.9000                 *.*                                786896   9216    143      0 0x0100 0x00000000

But there is no such process with pid 143:

admin@macos-01.ms-eu-dublin.nimbus.prater:~ % ps -axf | grep 143
 2000  1552   543   0  1:25PM ttys000    0:00.00 grep 143
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo kill -9 143
kill: 143: No such process

what the fuck is this shit - treize

jakubgs commented 3 years ago

I tried reproducing this with [the official build of 1.5.0 for amd64]() but it works fine and closes the port correctly:

NOT 2021-10-07 13:41:05.063+01:00 Shutting down after having received SIGTERM topics="beacnde" tid=3862896 file=nimbus_beacon_node.nim:1406
NOT 2021-10-07 13:41:05.063+01:00 Graceful shutdown                          topics="beacnde" tid=3862896 file=nimbus_beacon_node.nim:1349
DBG 2021-10-07 13:41:05.063+01:00 Closing discovery node                     topics="discv5" tid=3862896 file=protocol.nim:965 node=fb*e105a5:207.254.102.130:9000
DBG 2021-10-07 13:41:05.063+01:00 Server was closed                          topics="libp2p tcptransport" tid=3862896 file=tcptransport.nim:218 exc="Server is already closed!"
DBG 2021-10-07 13:41:05.063+01:00 Exception in accept loop, exiting          topics="libp2p switch" tid=3862896 file=switch.nim:200 exc="Transport closed, no more connections!"
NOT 2021-10-07 13:41:05.065+01:00 Databases closed                           topics="beacnde" tid=3862896 file=nimbus_beacon_node.nim:1362
 peers: 0 ❯ finalized: 8c0ebce4:0 ❯ head: 0bcf3a26:0:30 ❯ time: 44537:21 (1425205) ❯ sync: wwwwwwwwww:0:0.0000:0.0000:00h00m (30)                                                                                                                                                               ETH: 0 

administrator@macos-01.ms-eu-dublin.ci.misc:~/Downloads/nimbus-eth2_macOS_amd64_20211007_9ee13432/build % sudo lsof -PiUDP | grep 9000

administrator@macos-01.ms-eu-dublin.ci.misc:~/Downloads/nimbus-eth2_macOS_amd64_20211007_9ee13432/build % sudo netstat -lvn | grep -E '(pid|\.900)'                          
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)     rhiwat shiwat    pid   epid  state    options  

It's possible that this issue is specific to Darwin arm64 architecture.

jakubgs commented 3 years ago

I can reproduce it fine on arm64 with the official build:

NOT 2021-10-07 13:44:39.896+01:00 Shutting down after having received SIGINT topics="beacnde" tid=5011 file=nimbus_beacon_node.nim:1396
NOT 2021-10-07 13:44:39.896+01:00 Graceful shutdown                          topics="beacnde" tid=5011 file=nimbus_beacon_node.nim:1349
NOT 2021-10-07 13:44:39.903+01:00 Databases closed                           topics="beacnde" tid=5011 file=nimbus_beacon_node.nim:1362
 peers: 1 ❯ finalized: 4d611d5b:0 ❯ head: 4d611d5b:0:0 ❯ time: 69756:29 (2232221) ❯ sync: wPwwwwwwww:1:0.0000:0.0000:106751d23h47m (0)                                                                                                                                                          ETH: 0 

admin@macos-01.ms-eu-dublin.nimbus.prater:~/nimbus-eth2_macOS_arm64_20211007_9ee13432 % sudo lsof -PiUDP | grep 900

admin@macos-01.ms-eu-dublin.nimbus.prater:~/nimbus-eth2_macOS_arm64_20211007_9ee13432 % sudo netstat -lvn | grep -E '(pid|\.900)'
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)     rhiwat shiwat    pid   epid  state    options
udp4       0      0  *.9000                 *.*                                786896   9216    624      0 0x0100 0x00000000

admin@macos-01.ms-eu-dublin.nimbus.prater:~/nimbus-eth2_macOS_arm64_20211007_9ee13432 % sudo kill -9 624
kill: 624: No such process
jakubgs commented 3 years ago

What's weird is that I can reproduce this with the 3 ports I used for the 3 nodes I configured on that host: 9000-9002 But I cannot reproduce this issue on any other ports, so it seems to me like this host have been broken in some way.

I've requested a system reinstallation in a support ticket: https://portal.macstadium.com/tickets/140153

jakubgs commented 3 years ago

For future reference, some firewall related links I used:

jakubgs commented 3 years ago

Found some more issues with bootstrapping after the macos-01 was reinstalled:

jakubgs commented 3 years ago

I also added consul definitions to MacOS beacon node role: https://github.com/status-im/infra-role-beacon-node-macos/commit/96030a0c

And bound the nimbus account to 3000 UID to not clash with other accounts: https://github.com/status-im/infra-role-beacon-node-macos/commit/df3d2cea