Closed jakubgs closed 2 years ago
We currently use MacStadium for MacOS hosts for our CI: https://github.com/status-im/infra-ci/tree/master/modules/mac-stadium
They do provide the new Mini M1 for 109 USD per month, which isn't the worst price:
Some things to keep in mind:
The priority is to get it working. We can look into other things later if necessary.
I started working on this. We've bought the Gen 5 Mac Mini with M1 CPU and it's ready to use.
Currently I'm merging the existing macos roles from infra-ci into a new infra-role-bootstrap-macos role. One thing to keep in mind is that the roles are made for macOS 11 and might need some tweaking for macOS 12.
After that the plan is to create a infra-role-beacon-node-macos role which installs and runs the beacon node. We can then use it in the infra-nimbus fleet. Overall similar to the -windows roles but with macos instead.
@arthurk can you update this issue on any progress. I don't see infra-role-bootstrap-macos nor do I see any commits in infra-role-beacon-node-macos. Are you committing and just not pushing? I recommend pushing smaller chunks even if not functional initially.
Not much progress, I was working on the bridges for vac and the new deployment for chat2bridge
I've created the infra-role-bootstrap-macos repo (private) with the initial files. I'm testing it on the new host (the tasks were written for macOS 10 and the new host has macOS 11)
How is the progress? Are you stuck on anything?
Good progress, I've been running the process for a while and it's working well. I'm now working on distributing the validators. I'll also look a bit into setting up a firewall, it's disabled by default on the hosts but I'll need to check how to set it up via cli.
The infra-role-beacon-node-macos ansible role is almost finished. I've updated the readme with important info, the node is running, periodic builds and log rotation work fine. I'm currently working on integrating infra-role-dist-validators to distribute the secrets/validators for each node. It's almost identical to the linux config, so there shouldn't be any bigger issues with it.
There is one problem with the sudoers file where I can't get it to work for regular user to login as the nimbus user. It doesn't seem high prio so I'll fix it later on.
As for our bootstrap-macos role there are many things that can be optimized since macOS is not meant for servers. There are processes like SafariBookmarkSync and ParentalControl running for which there is no need. The OS has a firewall but it will need some work to figure out how to control it via CLI/Ansible. The role works fine right now but there's definitely room for improvement.
Quick update on where things are. I've included the config for prater in infra-nimbus at https://github.com/status-im/infra-nimbus/pull/68 and ran the full playbook for a new deployment (prater, unstable) with almost no problems. Build time is 3 minutes which is nice compared to other machines.
There was one issue when ansible was running the first build as part of the playbook run and failed with CMake not installed. Aborting
. After re-running the ansible playbook it worked again. This might be an edge-case for new deployments. I suspect something is wrong with the launchd config, if we can't figure out why it's happening we can just call the build.sh script in ansible instead of using the launchd module.
Otherwise some small issues I came across:
ansible-playbook ansible/prater.yml --tags infra-role-beacon-node-macos
. Not a valid login
error. Turns out it's because a field=
param was missing in the lookup. The error message is confusingI've updated the readme and added more info at https://github.com/status-im/infra-role-beacon-node-macos
Running the playbook with tags doesn't do anything
Because tags have to be explicitly defined in the Playbook, and we only include beacon-node
:
https://github.com/status-im/infra-nimbus/blob/4f05e2f40dddd17ecbe4ea630b357b44f0bffd82/ansible/prater.yml#L44-L45
https://github.com/status-im/infra-nimbus/blob/4f05e2f40dddd17ecbe4ea630b357b44f0bffd82/ansible/prater.yml#L68-L69
Still having a problem with CMake not installed. Aborting.
during deployment when running the build script.
Problem is that the PATH is wrong when ansible runs the task:
# ssh on server as nimbus user
echo $PATH
/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/homebrew/bin:/Library/Apple/usr/bin
# ansible "echo $PATH" command
/usr/bin:/bin:/usr/sbin:/sbin
cmake is in /opt/homebrew/bin
. Will debug this more on Monday.
I've checked the the problem again and currently don't know what the issue could be.
Right after running the ansible playbook I'm starting the build in launchd:
sudo launchctl start status.beacon-node-prater-stable-build
which leads to the CMake not installed. Aborting.
error. Which makes sense, when printing the PATH
it shows:
/usr/bin:/bin:/usr/sbin:/sbin
When I manually launch the build script as the nimbus user (./build.sh
) it works as expected since the user has /opt/homebrew/bin
in the path.
After that I start the same launchctl
job as above and it suddenly works. When I print the path it still shows PATH=/usr/bin:/bin:/usr/sbin:/sbin
but it can find Cmake and does the build successfully.
So now I'm trying to figure out why the launchd job works after the build.sh script has been run, but not before that.
The first build in a freshly cloned repo will pull all submodules and build the nim compiler which will fail since it's not in the PATH. But after the build was triggered manually by the nimbus user (with cmake in the PATH) all subsequent builds don't require cmake anymore as the already build nim compiler will not be rebuilt.
So what I learned here is that building nimbus-eth2 doesn't actually require cmake. Only building the nim compiler requires it, which is only compiled once and not rebuilt as part of the whole beacon node build process.
Sourcing the /etc/profile before starting the build will set the correct path and make the build work correctly on first run.
I saw this change: https://github.com/status-im/infra-nimbus/commit/92cfc833e
In which you were changing the validator layout and I just want to make sure you understand what you are doing, and that you cannot deploy the same validators to two or more different hosts or they will get slashed.
The validators you listed there are currently being used by stable-large-01.aws-eu-central-1a.nimbus.prater
:
https://github.com/status-im/infra-nimbus/blob/d7e0530d978acc5df16ac2e454f38792893efac4/ansible/group_vars/nimbus.prater.yml#L23
If you deploy that change along with the validators they will get slashed and will not work anymore.
Also, I could not find the IP of the new macos host in your branch, nor could I access it:
> ssh admin@207.254.102.130
(admin@207.254.102.130) Password:
I took at look at the bootstrap role and for some reason it doesn't include me: https://github.com/status-im/infra-role-bootstrap-macos/blob/24b32c3681efee40129542e0809281d71059ba09/defaults/main.yml#L15-L16
And the variables do not follow the bootstrap__
naming pattern.
I saw this change: 92cfc83
In which you were changing the validator layout and I just want to make sure you understand what you are doing, and that you cannot deploy the same validators to two or more different hosts or they will get slashed.
The validators you listed there are currently being used by
stable-large-01.aws-eu-central-1a.nimbus.prater
:If you deploy that change along with the validators they will get slashed and will not work anymore.
I was only using this for testing. The PR is still in "draft" mode and has a "wip" in the title, no need to review it now
Add a user for me like in other bootstrap roles. I want to take a look.
I've added a user for you in https://github.com/status-im/infra-role-bootstrap-macos/commit/6ffc9ed5e22e4a44eb25c10d46bc7a4620f80f8d
Now that https://github.com/status-im/infra-nimbus/pull/68 is merged we'll need two more things to get metrics:
Since you have only one week left I'd like you to work at least on the Consul agent config in the bootstrap role, similar to how it's done in the Linux and Windows ones.
I've fixed the MacOS PR for Consul agent service: https://github.com/status-im/infra-role-bootstrap-macos/pull/1
And deployed the change adjusting Consul data center to be he-eu-hel1
: https://github.com/status-im/infra-nimbus/commit/67c7eff4
admin@node-01.he-eu-hel1.consul.hq:~ % consul members | grep prater
macos-01.ms-eu-dublin.nimbus.prater 207.254.102.130:8301 alive client 1.10.1 3 he-eu-hel1 <default>
metal-01.he-eu-hel1.nimbus.prater 65.21.73.183:8301 alive client 1.10.1 3 he-eu-hel1 <default>
metal-02.he-eu-hel1.nimbus.prater 65.108.5.45:8301 alive client 1.10.1 3 he-eu-hel1 <default>
Also added some docs: https://github.com/status-im/infra-role-bootstrap-macos/commit/e0332cd3
Good article on launchctl
usage and a docs page:
I had to do a bunch of other fixes to the role:
These are useful resources for WireGuard setup on MacOS:
I have WireGuard pretty much working, but I have one issue. I can't ping the VPN interface locally:
admin@macos-01.ms-eu-dublin.ci.misc:~ % ping -c3 10.14.0.27
PING 10.14.0.27 (10.14.0.27): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
--- 10.14.0.27 ping statistics ---
3 packets transmitted, 0 packets received, 100.0% packet loss
Which apparently is an issue with how wireguard-go uses the utun
type virtual interface, based on these issues:
We can see how the interface is created in /var/log/wireguard.log
:
[#] wireguard-go utun
[+] Interface for wg0 is utun0
[#] wg setconf utun0 /dev/fd/63
[#] ifconfig utun0 inet 10.14.0.27 10.14.0.27 alias
[#] ifconfig utun0 up
And we can see that it has the 10.14.0.27
address:
% sudo ifconfig utun0
utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1420
inet 10.14.0.27 --> 10.14.0.27 netmask 0xff000000
As suggested in some issues I tried adding a rule like that to /etc/pf.anchors/com.wireguard
:
rdr on utun0 from any to 10.14.0.27 -> lo0
But the issues persists. I might just change the healthcheck to ping something else for now, because that does work:
admin@macos-01.ms-eu-dublin.ci.misc:~ % ping -c3 10.14.0.1
PING 10.14.0.1 (10.14.0.1): 56 data bytes
64 bytes from 10.14.0.1: icmp_seq=0 ttl=64 time=88.552 ms
64 bytes from 10.14.0.1: icmp_seq=1 ttl=64 time=43.771 ms
64 bytes from 10.14.0.1: icmp_seq=2 ttl=64 time=43.849 ms
--- 10.14.0.1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 43.771/58.724/88.552/21.092 ms
This doesn't solve the ping issue, but it does make for a decent healthcheck for Consul:
% ping -c3 -b utun0 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.051 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.152 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.132 ms
--- 127.0.0.1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.051/0.112/0.152/0.044 ms
Instead of pinging the WireGuard IP I ping localhost from the WireGuard virtual interface, which does verify it's up.
The script in wireguard-tools
by defaults creates a file at /var/run/wireguard/wg0.name
with interface name.
This is necessary because:
Since the utun driver cannot have arbitrary interface names, you must either use
utun[0-9]+
for an explicit interface name orutun
to have the kernel select one for you. If you chooseutun
as the interface name, and the environment variableWG_TUN_NAME_FILE
is defined, then the actual name of the interface chosen by the kernel is written to the file specified by that variable.
Here's the implementation of WireGuard setup on MacOS: https://github.com/status-im/infra-role-wireguard/commit/13f56f76
There's something weird about nodes on MacOS, because they are not listening on the LibP2P TCP ports:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo nmap -Pn -p9000-9002 localhost
Host is up (0.00011s latency).
Other addresses for localhost (not scanned): ::1
PORT STATE SERVICE
9000/tcp closed cslistener
9001/tcp closed tor-orport
9002/tcp closed dynamid
Nmap done: 1 IP address (1 host up) scanned in 0.05 seconds
But I can see the UDP ports are being listened on:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo lsof -iUDP -n -P | grep nimbus
nimbus_be 84441 nimbus 16u IPv4 0xcdf0ae1fd8ec71a9 0t0 UDP *:9000
nimbus_be 84447 nimbus 16u IPv4 0xcdf0ae1fd8ec7789 0t0 UDP *:9001
nimbus_be 84455 nimbus 16u IPv4 0xcdf0ae1fd8ec8059 0t0 UDP *:9002
But for TCP I can only see the RPC/REST/metrics ports open:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo lsof -iTCP -sTCP:LISTEN -n -P | grep nimbus
nimbus_be 84441 nimbus 3u IPv4 0xcdf0ae1fec04f219 0t0 TCP *:9200 (LISTEN)
nimbus_be 84441 nimbus 11u IPv4 0xcdf0ae1fec769de9 0t0 TCP 127.0.0.1:9900 (LISTEN)
nimbus_be 84441 nimbus 12u IPv4 0xcdf0ae1fec050649 0t0 TCP 127.0.0.1:9300 (LISTEN)
nimbus_be 84447 nimbus 3u IPv4 0xcdf0ae1febf3dde9 0t0 TCP *:9201 (LISTEN)
nimbus_be 84447 nimbus 11u IPv4 0xcdf0ae1fecbdd061 0t0 TCP 127.0.0.1:9901 (LISTEN)
nimbus_be 84447 nimbus 12u IPv4 0xcdf0ae1febfa9de9 0t0 TCP 127.0.0.1:9301 (LISTEN)
nimbus_be 84455 nimbus 3u IPv4 0xcdf0ae1fecbd9de9 0t0 TCP *:9202 (LISTEN)
nimbus_be 84455 nimbus 11u IPv4 0xcdf0ae1febd69de9 0t0 TCP 127.0.0.1:9902 (LISTEN)
nimbus_be 84455 nimbus 12u IPv4 0xcdf0ae1fec764649 0t0 TCP 127.0.0.1:9302 (LISTEN)
Host reboot fixed that, but weird:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo nmap -Pn -p9000-9002 localhost
Host is up (0.00013s latency).
Other addresses for localhost (not scanned): ::1
PORT STATE SERVICE
9000/tcp open cslistener
9001/tcp open tor-orport
9002/tcp open dynamid
One issue I identified is that if Application Firewall is enabled WireGuard does not accept new connections.
It can be enabled/disabled with these commands:
> sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
Firewall is disabled. (State = 0)
> sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate on
Firewall is enabled. (State = 1)
> sudo /usr/libexec/ApplicationFirewall/socketfilterfw --getglobalstate
Firewall is enabled. (State = 1)
But you can also add individual applications:
> sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /opt/homebrew/bin/wg-quick
Application at path ( /opt/homebrew/bin/wg-quick ) added to firewall
> sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /opt/homebrew/bin/wireguard-go
Application at path ( /opt/homebrew/bin/wireguard-go ) added to firewall
But as we can see the commands read the symlinks and add specific versions:
> sudo /usr/libexec/ApplicationFirewall/socketfilterfw --listapps
ALF: total number of apps = 3
1 : /System/Library/CoreServices/RemoteManagement/ARDAgent.app
( Allow incoming connections )
2 : /opt/homebrew/Cellar/wireguard-tools/1.0.20210914/bin/wg-quick
( Allow incoming connections )
3 : /opt/homebrew/Cellar/wireguard-go/0.0.20210424/bin/wireguard-go
( Allow incoming connections )
So this has to be repeated after every upgrade to the WireGuard packages.
It looks like application firewall is disabled by default on MacStadium MacOS hosts, for example macos-02
in our CI:
administrator@macos-02.ms-eu-dublin.ci.misc:~ % sudo /usr/libexec/ApplicationFirewall/socketfilterfw --getglobalstate
Firewall is disabled. (State = 0)
administrator@macos-02.ms-eu-dublin.ci.misc:~ % sudo /usr/libexec/ApplicationFirewall/socketfilterfw --getloggingmode
Log mode is on
administrator@macos-02.ms-eu-dublin.ci.misc:~ % sudo /usr/libexec/ApplicationFirewall/socketfilterfw --getstealthmode
Stealth mode disabled
This is interesting. As opposed to Linux, on MacOS if you run nmap
on a port that's not being used you get closed
:
> sudo nmap -Pn -p8080 macos-01.ms-eu-dublin.nimbus.prater.statusim.net
Nmap scan report for macos-01.ms-eu-dublin.nimbus.prater.statusim.net (207.254.102.130)
PORT STATE SERVICE
8080/tcp closed http-proxy
Bur if I start a netcat
server on the host on port 8080
:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo nc -l 0.0.0.0 8080
It starts appearing as filtered
:
> sudo nmap -Pn -p8080 macos-01.ms-eu-dublin.nimbus.prater.statusim.net
Nmap scan report for macos-01.ms-eu-dublin.nimbus.prater.statusim.net (207.254.102.130)
PORT STATE SERVICE
8080/tcp filtered http-proxy
Which doesn't make sense, since it should appear as filtered
in both cases, but okay...
Another weird thing. You need to send SIGHUP
to the socketfilterfw
process for changes to take effect:
> sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
Firewall is disabled. (State = 0)
> sudo pkill -HUP socketfilterfw
Otherwise the changes do not take effect. Found that out in this repo with various security articles.
It appears that if we want to have application firewall enabled then we can't use symlinks for beacon node binaries:
admin@macos-01.ms-eu-dublin.nimbus.prater:/Library/LaunchDaemons % sudo /usr/libexec/ApplicationFirewall/socketfilterfw --listapps
ALF: total number of apps = 7
1 : /System/Library/CoreServices/RemoteManagement/ARDAgent.app
( Allow incoming connections )
2 : /Users/nimbus/beacon-node-prater-stable/repo/build/nimbus_beacon_node_f52efc0c
( Allow incoming connections )
Because that confuses the firewall, since the process appears to use the symlink, the the binary it points to:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % ps -ax | grep prater-stable
143 ?? 0:07.47 /Users/nimbus/beacon-node-prater-stable/repo/build/nimbus_beacon_node --network=prater --data-dir=/Users/nimbus/beacon-node-prater-stable/data/shared_prater_0 --web3-url=wss://goerli.infura.io/ws/v3/6224f3c792cc443fafb64e70a98f871e --nat=extip:207.254.102.130 --log-level=DEBUG --tcp-port=9000 --udp-port=9000 --max-peers=300 --num-threads=1 --netkey-file=/Users/nimbus/beacon-node-prater-stable/data/netkey --slashing-db-kind=v2 --insecure-netkey-password=true --subscribe-all-subnets=false --doppelganger-detection=true --rpc=true --rpc-address=127.0.0.1 --rpc-port=9900 --rest=true --rest-address=127.0.0.1 --rest-port=9300 --metrics=true --metrics-address=0.0.0.0 --metrics-port=9
I don't get this firewall at all. I rebooted the host and now all Libp2p ports are available:
> sudo nmap -Pn -sT -p9000-9002 macos-01.ms-eu-dublin.nimbus.prater.statusim.net
Nmap scan report for macos-01.ms-eu-dublin.nimbus.prater.statusim.net (207.254.102.130)
Host is up (0.046s latency).
PORT STATE SERVICE
9000/tcp open cslistener
9001/tcp open tor-orport
9002/tcp open dynamid
But so are the metrics ones, which I didn't enable:
> sudo nmap -Pn -sT -p9200-9202 macos-01.ms-eu-dublin.nimbus.prater.statusim.net
Nmap scan report for macos-01.ms-eu-dublin.nimbus.prater.statusim.net (207.254.102.130)
Host is up (0.046s latency).
PORT STATE SERVICE
9200/tcp open wap-wsp
9201/tcp open wap-wsp-wtp
9202/tcp open wap-wsp-s
Adding the binaries does work, but it requires the process restart to take effect:
PORT STATE SERVICE
9200/tcp filtered wap-wsp
% sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /Users/nimbus/beacon-node-prater-stable/repo/build/nimbus_beacon_node_f52efc0c
Application at path ( /Users/nimbus/beacon-node-prater-stable/repo/build/nimbus_beacon_node_f52efc0c ) added to firewall
PORT STATE SERVICE
9200/tcp filtered wap-wsp
% sudo launchctl unload status.beacon-node-prater-stable.plist
% sudo launchctl load status.beacon-node-prater-stable.plist
PORT STATE SERVICE
9200/tcp open wap-wsp
Also, it looks like Consul is not getting up after reboot:
admin@macos-01.ms-eu-dublin.nimbus.prater:/Library/LaunchDaemons % tail -n5 /var/log/consul/consul.log
2021-10-07T12:17:02.770+0100 [INFO] agent: Stopping server: address=[::]:8500 network=tcp protocol=http
2021-10-07T12:17:02.771+0100 [INFO] agent: Waiting for endpoints to shut down
2021-10-07T12:17:02.771+0100 [INFO] agent: Endpoints down
2021-10-07T12:17:02.771+0100 [INFO] agent: Exit code: code=0
==> system allows a max of 256 file descriptors, but limits.http_max_conns_per_client: 500 needs at least 520
It appears the issue is that consul
is being started before our limit.maxfiles.plist
command is run that increases the file limit.
We can try to mitigate this by using the OtherJobEnabled
Launchd parameters as suggested in this answer.
Looks like just adding config to make the service restart if failed works: https://github.com/status-im/infra-role-bootstrap-macos/commit/19210d30
<key>KeepAlive</key>
<dict>
<key>SuccessfulExit</key>
<false/>
</dict>
Result:
==> system allows a max of 256 file descriptors, but limits.http_max_conns_per_client: 500 needs at least 520
==> Starting Consul agent...
Version: '1.10.1'
Node ID: 'cba364c3-44be-e81b-1071-8c26c9baf29f'
Node name: 'macos-01.ms-eu-dublin.nimbus.prater'
...
Now I'm getting a restart loop due to Nimbus thinking the UDP libp2p discovery port is in use:
{"lvl":"INF","ts":"2021-10-07 12:58:11.765+01:00","msg":"Starting discovery node","topics":"discv5","tid":11466,"file":"protocol.nim:935","node":"1b*296ec8:207.254.102.130:9000","bindAddress":{"ip":"0.0.0.0","port":9000}}
{"lvl":"FAT","ts":"2021-10-07 12:58:11.765+01:00","msg":"Failed to start discovery service. UDP port may be already in use","topics":"networking","tid":11466,"file":"eth2_network.nim:1383"}
When it clearly is NOT in use:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo lsof -nP -iUDP | grep 9000
admin@macos-01.ms-eu-dublin.nimbus.prater:~ %
Appears to come from here.
What the hell is happening? lsof
doesn't show the ports being used but netstat
does:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo netstat -lvn | grep -E '(pid|9000)'
Proto Recv-Q Send-Q Local Address Foreign Address (state) rhiwat shiwat pid epid state options
udp4 97354 0 *.9000 *.* 786896 9216 143 0 0x0100 0x00000000
But there is no such process with pid 143
:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % ps -axf | grep 143
2000 1552 543 0 1:25PM ttys000 0:00.00 grep 143
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo kill -9 143
kill: 143: No such process
I tried reproducing this with [the official build of 1.5.0
for amd64
]() but it works fine and closes the port correctly:
NOT 2021-10-07 13:41:05.063+01:00 Shutting down after having received SIGTERM topics="beacnde" tid=3862896 file=nimbus_beacon_node.nim:1406
NOT 2021-10-07 13:41:05.063+01:00 Graceful shutdown topics="beacnde" tid=3862896 file=nimbus_beacon_node.nim:1349
DBG 2021-10-07 13:41:05.063+01:00 Closing discovery node topics="discv5" tid=3862896 file=protocol.nim:965 node=fb*e105a5:207.254.102.130:9000
DBG 2021-10-07 13:41:05.063+01:00 Server was closed topics="libp2p tcptransport" tid=3862896 file=tcptransport.nim:218 exc="Server is already closed!"
DBG 2021-10-07 13:41:05.063+01:00 Exception in accept loop, exiting topics="libp2p switch" tid=3862896 file=switch.nim:200 exc="Transport closed, no more connections!"
NOT 2021-10-07 13:41:05.065+01:00 Databases closed topics="beacnde" tid=3862896 file=nimbus_beacon_node.nim:1362
peers: 0 ❯ finalized: 8c0ebce4:0 ❯ head: 0bcf3a26:0:30 ❯ time: 44537:21 (1425205) ❯ sync: wwwwwwwwww:0:0.0000:0.0000:00h00m (30) ETH: 0
administrator@macos-01.ms-eu-dublin.ci.misc:~/Downloads/nimbus-eth2_macOS_amd64_20211007_9ee13432/build % sudo lsof -PiUDP | grep 9000
administrator@macos-01.ms-eu-dublin.ci.misc:~/Downloads/nimbus-eth2_macOS_amd64_20211007_9ee13432/build % sudo netstat -lvn | grep -E '(pid|\.900)'
Proto Recv-Q Send-Q Local Address Foreign Address (state) rhiwat shiwat pid epid state options
It's possible that this issue is specific to Darwin arm64
architecture.
I can reproduce it fine on arm64
with the official build:
NOT 2021-10-07 13:44:39.896+01:00 Shutting down after having received SIGINT topics="beacnde" tid=5011 file=nimbus_beacon_node.nim:1396
NOT 2021-10-07 13:44:39.896+01:00 Graceful shutdown topics="beacnde" tid=5011 file=nimbus_beacon_node.nim:1349
NOT 2021-10-07 13:44:39.903+01:00 Databases closed topics="beacnde" tid=5011 file=nimbus_beacon_node.nim:1362
peers: 1 ❯ finalized: 4d611d5b:0 ❯ head: 4d611d5b:0:0 ❯ time: 69756:29 (2232221) ❯ sync: wPwwwwwwww:1:0.0000:0.0000:106751d23h47m (0) ETH: 0
admin@macos-01.ms-eu-dublin.nimbus.prater:~/nimbus-eth2_macOS_arm64_20211007_9ee13432 % sudo lsof -PiUDP | grep 900
admin@macos-01.ms-eu-dublin.nimbus.prater:~/nimbus-eth2_macOS_arm64_20211007_9ee13432 % sudo netstat -lvn | grep -E '(pid|\.900)'
Proto Recv-Q Send-Q Local Address Foreign Address (state) rhiwat shiwat pid epid state options
udp4 0 0 *.9000 *.* 786896 9216 624 0 0x0100 0x00000000
admin@macos-01.ms-eu-dublin.nimbus.prater:~/nimbus-eth2_macOS_arm64_20211007_9ee13432 % sudo kill -9 624
kill: 624: No such process
What's weird is that I can reproduce this with the 3 ports I used for the 3 nodes I configured on that host: 9000-9002
But I cannot reproduce this issue on any other ports, so it seems to me like this host have been broken in some way.
I've requested a system reinstallation in a support ticket: https://portal.macstadium.com/tickets/140153
For future reference, some firewall related links I used:
Found some more issues with bootstrapping after the macos-01
was reinstalled:
I also added consul definitions to MacOS beacon node role: https://github.com/status-im/infra-role-beacon-node-macos/commit/96030a0c
And bound the nimbus
account to 3000
UID to not clash with other accounts: https://github.com/status-im/infra-role-beacon-node-macos/commit/df3d2cea
We need a MacOS host for Prater testnet nodes. The minimum hardware requirements:
It will run 3 instances of
infra-role-beacon-node
connected to the Prater testnet. Each instance will run a build from a different branch (unstable
,testing
,stable
). The nodes will take over validators of the current Prater testnet nodes with04
index (e.g.stable-04
,testing-04
, etc).It should also build the newest version of respective branch daily.
Full Details: https://github.com/status-im/infra-nimbus/issues/58