Closed jakubgs closed 2 years ago
This is a good lecture on MacOS packet filter(pf
) firewall: https://www.youtube.com/watch?v=SOWQCAA8lZA
I found the slides: https://macadmins.psu.edu/files/2017/07/psumac2017-148-Packet-Filtering-Mac-OSX-Under-the-Hood-with-Apple-PF-2c3abhl.pdf
And his firewall setup: https://github.com/jhimes/PF-setup
I re-installed the host and re-deployed the nodes and so far the issue does not appear, but instead I'm seeing something new.
The metrics endpoint for stable node stopped responding during the night:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % curl -sv localhost:9200/health
* Trying ::1...
* TCP_NODELAY set
* Connection failed
* connect to ::1 port 9200 failed: Connection refused
* Trying 127.0.0.1...
* TCP_NODELAY set
But when I call the REST API endpoint for the same node it works fine:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % curl -sS localhost:9300/eth/v1/node/version
{"data":{"version":"Nimbus/v1.5.0-f52efc-stateofus"}}
check the process ports they are all open, including 9200
for metrics:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo lsof -iTCP -sTCP:LISTEN -n -P | grep 49291
nimbus_be 49291 nimbus 3u IPv4 0x12e16c957846c00b 0t0 TCP *:9200 (LISTEN)
nimbus_be 49291 nimbus 11u IPv4 0x12e16c9578466bdb 0t0 TCP 127.0.0.1:9900 (LISTEN)
nimbus_be 49291 nimbus 12u IPv4 0x12e16c9578464d93 0t0 TCP 127.0.0.1:9300 (LISTEN)
nimbus_be 49291 nimbus 17u IPv4 0x12e16c95784597ab 0t0 TCP *:9000 (LISTEN)
Which makes no sense... why is MacOS such trash OS?
And I don't see any informative or related errors at the time the alert first showed up: 2021-10-08T02:23:23Z
{"lvl":"DBG","ts":"2021-10-08 02:20:01.075+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*5kxHjx","responseCode":1,"errMsg":"Failed to decompress snappy payload"}
{"lvl":"DBG","ts":"2021-10-08 02:20:01.075+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:20:11.003+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*ZiFi3u","responseCode":1,"errMsg":"Incomplete request"}
{"lvl":"DBG","ts":"2021-10-08 02:20:11.003+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"statusObj"}
{"lvl":"DBG","ts":"2021-10-08 02:21:04.862+01:00","msg":"A future has failed, enable trace logging for details","topics":"libp2p muxer","tid":322828,"file":"errors.nim:31","error":"TooManyConnectionsError"}
{"lvl":"DBG","ts":"2021-10-08 02:22:38.685+01:00","msg":"A future has failed, enable trace logging for details","topics":"libp2p connmanager","tid":322828,"file":"errors.nim:31","error":"LPStreamClosedError"}
{"lvl":"DBG","ts":"2021-10-08 02:22:42.899+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*2B5mj7","responseCode":1,"errMsg":"Incomplete request"}
{"lvl":"DBG","ts":"2021-10-08 02:22:42.899+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:23:14.514+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*2B5mj7","responseCode":1,"errMsg":"Failed to decompress snappy payload"}
{"lvl":"DBG","ts":"2021-10-08 02:23:14.514+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:23:42.471+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*2B5mj7","responseCode":1,"errMsg":"Incomplete request"}
{"lvl":"DBG","ts":"2021-10-08 02:23:42.471+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"statusObj"}
{"lvl":"DBG","ts":"2021-10-08 02:23:44.652+01:00","msg":"A future has failed, enable trace logging for details","topics":"libp2p muxer","tid":322828,"file":"errors.nim:31","error":"TooManyConnectionsError"}
{"lvl":"DBG","ts":"2021-10-08 02:24:37.651+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*QfQpvZ","responseCode":1,"errMsg":"Failed to decompress snappy payload"}
{"lvl":"DBG","ts":"2021-10-08 02:24:37.651+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:24:58.909+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*ZiFi3u","responseCode":1,"errMsg":"Incomplete request"}
{"lvl":"DBG","ts":"2021-10-08 02:24:58.909+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:25:16.223+01:00","msg":"Error processing request","topics":"networking","tid":322828,"file":"eth2_network.nim:539","peer":"16U*2YTb8s","responseCode":1,"errMsg":"Failed to decompress snappy payload"}
{"lvl":"DBG","ts":"2021-10-08 02:25:16.223+01:00","msg":"Error processing an incoming request","topics":"sync","tid":322828,"file":"eth2_network.nim:787","err":"Stream Closed!","msgName":"pingObj"}
{"lvl":"DBG","ts":"2021-10-08 02:25:53.319+01:00","msg":"A future has failed, enable trace logging for details","topics":"libp2p muxer","tid":322828,"file":"errors.nim:31","error":"TooManyConnectionsError"}
This is something worth noting:
...one of the drawbacks of modifying
pf.conf
directly is that macOS upgrades revert that file to its default contents (removing your custom rules). For example, in an upgrade from High Sierra (macOS 10.13.x) to Catalina (10.15.x), the following pf files were overwritten on my test Mac:
/etc/pf.conf
/etc/pf.anchors/com.apple
Custom anchors under /etc/pf.anchors/ were retained, but they were not especially useful since the references to them in pf.conf were overwritten!
https://blog.neilsabol.site/post/quickly-easily-adding-pf-packet-filter-firewall-rules-macos-osx/
Based on the information above I think it's better to create a separate /etc/firewall.conf
file that will be loaded by a custom service.
This is a basic ruleset that I test and it works:
# Don't filter loopback interface.
set skip on lo0
# Allows SSH connections.
pass in quick proto tcp from any to any port 22
# Block all incoming traffic.
block in all
This allows for all local traffic and external SSH access while blocking anything else incoming.
This means this rule also blocks incoming traffic on the WireGuard VPN.
I've finished the new firewall setup:
Now all the configuration is located in /etc/firewall.conf
and related files. More can be read in the readme.
Result:
> sudo nmap -Pn -p22,8301,9000-9002,9200-9202 macos-01.ms-eu-dublin.nimbus.prater.statusim.net
Nmap scan report for macos-01.ms-eu-dublin.nimbus.prater.statusim.net (207.254.102.130)
Host is up (0.048s latency).
PORT STATE SERVICE
22/tcp open ssh
8301/tcp open amberon
9000/tcp open cslistener
9001/tcp open tor-orport
9002/tcp open dynamid
9200/tcp filtered wap-wsp
9201/tcp filtered wap-wsp-wtp
9202/tcp filtered wap-wsp-s
Nmap done: 1 IP address (1 host up) scanned in 14.27 seconds
I'm beginning to think the issue is memory related, since the host has only 8GB of RAM, and usual usage on Hetzner hosts is around 9GB if not counting cache:
But we don't have memory usage metrics for MacOS yet, since no Netdata setup was done.
I could try getting a bigger Mac mini but the next cancelation date is on 2021-10-22:
But the host costs $109, so it's not that bad: https://docs.macstadium.com/docs/how-do-i-end-or-cancel-my-subscriptio
I've deployed Netdata to the MacOS hosts:
It works:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % curl -s 'localhost:8001/api/v1/allmetrics?format=prometheus' | head
netdata_info{instance="macos-01.ms-eu-dublin.nimbus.prater",application="netdata",version="v1.29.3"} 1 1634051353246
netdata_host_tags_info{} 1 1634051353246
netdata_host_tags{} 1 1634051353246
netdata_disk_svctm_milliseconds_operation_average{chart="disk_svctm.disk0",family="disk0",dimension="svctm"} 0.1333720 1634051335000
netdata_disk_avgsz_KiB_operation_average{chart="disk_avgsz.disk0",family="disk0",dimension="reads"} 15.8251950 1634051335000
netdata_disk_avgsz_KiB_operation_average{chart="disk_avgsz.disk0",family="disk0",dimension="writes"} -39.6865200 1634051335000
netdata_disk_await_milliseconds_operation_average{chart="disk_await.disk0",family="disk0",dimension="reads"} 0.1755850 1634051335000
netdata_disk_await_milliseconds_operation_average{chart="disk_await.disk0",family="disk0",dimension="writes"} -0.1080440 1634051335000
netdata_job_size_KB_average{chart="cups.job_size",family="overview",dimension="pending"} 0.0000000 1634051335000
netdata_job_size_KB_average{chart="cups.job_size",family="overview",dimension="held"} 0.0000000 1634051335000
Created an issue about MacOS metrics and API being flaky: https://github.com/status-im/nimbus-eth2/issues/2984
Apparently by default MacOS does not give remote access to non-admin users: https://www.vinnie.work/blog/2020-12-26-why-so-hard-osx-ssh-access/
So they either have to be in admin
group, or be given right to SSH into the host explicitly:
dseditgroup -o edit -n /Local/Default -u vinnie -p -a joe -t user com.apple.access_ssh
I adjusted the Ansible role to grant SSH access to users: https://github.com/status-im/infra-role-bootstrap-macos/commit/52273416
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo dscacheutil -q group -a name com.apple.access_ssh
name: com.apple.access_ssh
password: *
gid: 399
users: auto admin jakub petty zahary dustin mamy stefan dryajov kim giovanni tanguy cheatfate
But I also added Nimbus team members to admin
groups to make debugging easier: https://github.com/status-im/infra-nimbus/commit/7aa5d2d4
Also, I've received a response to my ticket about inaccessible DNS servers: https://portal.macstadium.com/tickets/140395
Our servers were:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % tail -n4 /etc/resolv.conf
nameserver 207.254.72.253
nameserver 207.254.72.254
nameserver 8.8.8.8
nameserver 8.8.4.4
But apparently:
Good afternoon. Could I propose that you change the DNS servers to the Dublin MacStadium DNS servers as upon further investigation it seems your Mini was incorrectly set with the Las Vegas MacStadium DNS servers. At your discretion, you may also want to configure your preferred Public DNS as a fall-back as a third DNS server.
MacStadium Dublin DNS1: 207.254.25.253 MacStadium Dublin DNS2: 207.254.25.254
Regards, Mark
Of course Apple has to re-invent everything, so I had to modify the DNS configuration using networksetup
command:
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % networksetup -getdnsservers Ethernet
207.254.72.253
207.254.72.254
8.8.8.8
8.8.4.4
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % networksetup -setdnsservers Ethernet 207.254.25.253 207.254.25.254 8.8.8.8 8.8.4.4 1.1.1.1
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % networksetup -getdnsservers Ethernet
207.254.25.253
207.254.25.254
8.8.8.8
8.8.4.4
1.1.1.1
So now it's correct end resolves queries fast.
To help with debugging I got SIP disabled via MacStadium support ticket: https://portal.macstadium.com/tickets/140647
admin@macos-01.ms-eu-dublin.nimbus.prater:~ % sudo csrutil status
System Integrity Protection status: disabled.
After investigation in https://github.com/status-im/nimbus-eth2/issues/2984 and thanks to suggestions from @stefantalpalaru it appears that the issue was indeed the file limit, which when set to a value higher than 65536
it is simple ignored by the OS, which is dumb beyond belief. And what's even dumber is that getrlimit()
keeps reporting the fake limit for some reason.
Fixed in bootstrap: https://github.com/status-im/infra-role-bootstrap-macos/commit/a0ab4fb8
Looks like we'll be able to move validators to our new MacOS host next week.
I also tested changing limits in /etc/sysctl.conf
, but as far as I saw it had no effect:
kern.maxfiles=20000000
kern.maxfilesperproc=20000000
As suggested by Zah I've lowered the max peers limit to 200 to avoid hitting the open files limit: https://github.com/status-im/infra-nimbus/commit/5f04e4b1
I've migrated the validators from AWS prater 02
nodes to the MacOS host:
It appears to be working fine:
admin@macos-01.ms-eu-dublin.nimbus.prater:/Users/nimbus % grep 'Attestation sent' beacon-node-*/logs/service.log | wc -l
7019
I'd like to get rid of all 3 02
AWS nodes, but I can't because one of them is being used as Prater bootstrap node:
https://github.com/status-im/infra-nimbus/blob/961756674c6d3e7f3b512f166a59733a1a177899/ansible/host_vars/stable-large-02.aws-eu-central-1a.nimbus.prater.yml#L1-L2
But my idea is to get rid of both testing
nodes and then:
testing-large-01.aws-eu-central-1a.nimbus.prater
testing-large-02.aws-eu-central-1a.nimbus.prater
Rename stable-large-02.aws-eu-central-1a.nimbus.prater
to testing-large-01.aws-eu-central-1a.nimbus.prater
.
This way we can keep it without changing the IP or node ID. I just have to make sure the port stays the same.
It is done: https://github.com/status-im/infra-nimbus/commit/b5684f01
Changes to Outputs:
~ hosts = {
- stable-large-02.aws-eu-central-1a.nimbus.prater = "3.65.99.236" -> null
~ testing-large-01.aws-eu-central-1a.nimbus.prater = "18.195.151.76" -> "3.65.99.236"
# (26 unchanged elements hidden)
}
Before:
> dig +short stable-large-01.aws-eu-central-1a.nimbus.prater.statusim.net
3.120.202.252
> dig +short stable-large-02.aws-eu-central-1a.nimbus.prater.statusim.net
3.65.99.236
> dig +short testing-large-01.aws-eu-central-1a.nimbus.prater.statusim.net
18.195.151.76
Now:
> dig +short stable-large-01.aws-eu-central-1a.nimbus.prater.statusim.net
3.120.202.252
> dig +short testing-large-01.aws-eu-central-1a.nimbus.prater.statusim.net
3.65.99.236
The node key and port was preserved.
I will monitor the macos-01
host over the weekend and if everything goes well this issue can be closed.
As far as I can see all three nodes on the three main branches work fine:
I consider this task done.
We need a MacOS host for Prater testnet nodes. The minimum hardware requirements:
It will run 3 instances of
infra-role-beacon-node
connected to the Prater testnet. Each instance will run a build from a different branch (unstable
,testing
,stable
). The nodes will take over validators of the current Prater testnet nodes with04
index (e.g.stable-04
,testing-04
, etc).It should also build the newest version of respective branch daily.
Full Details: https://github.com/status-im/infra-nimbus/issues/58