Closed jakubgs closed 3 years ago
If you look at https://github.com/docker/for-linux/issues/335, there isn't any explanation either and yet it happens across many setups. To figure out what's causing it at this point would require some heavy duty digging into the OS and Docker TCP stacks - which I don't think we have the time nor the resources for, ATM.
This is also something that hasn't been widely reported by any other user and we know for a fact that some do run docker, so quite possibly this can be a combination of our cloud provider and some specific middleware, the OS, Docker and then libp2p - which one it is beats me and like I said, I don't think we neither have the resources nor is it worth spending the time tracking it down at this time.
I suggest we look for acceptable workarounds for now and address this properly once it has been confirmed to be a wider issue (or as I suspect, docker fixes it in a future release, but so far they haven't acknowledged it as an issue, IIUC).
So let's look at our options:
v17.0.9
which seems to fix it as per the docker issues linked aboveThis is also something that hasn't been widely reported by any other user and we know for a fact that some do run docker, so quite possibly this can be a combination of our cloud provider and some specific middleware, the OS, Docker and then libp2p
Yes, I also think that's possible. Which is why this is such a pain. It's most probably a combination of a few things.
Yeah, it's a pesky issue for sure and I think right now it comes down to mitigating the cost by finding a reasonable workaround rather than trying to pinpoint what it is exactly - which can be a time sink of epic proportions.
I agree. Though my question would be. What's the harm? I mean, are we just worried that this will cause some issues in the future? Because as far as I know there is nothing wrong with these old connections lingering. Unless there's a symptom that I'm not aware of.
Simplest solution is indeed just running it directly via Systemd. But that leaves the issue of distributing new builds. We'll need a new setup for doing that.
But my question still stands. Why is this an issue, other than it's just not pretty? Is this causing any actual degradation in functionality of the node?
I agree. Though my question would be. What's the harm? I mean, are we just worried that this will cause some issues in the future? Because as far as I know there is nothing wrong with these old connections lingering. Unless there's a symptom that I'm not aware of.
Oh, it's definitely a resource leak. This port pair isn't usable anymore so the OS can run out of port pairs at some point.
For example, when a remote establishes a connection to port 9001
, the OS will assign a random incoming port in the form of xxxx:9001
, if this connection never closes properly, the OS won't release that port pair and it won't be reusable until the socket is destroyed, which means that new remotes won't be able to connect anymore once the OS runs out of port pairs.
The one thing that puzzles me tho, is why aren't this connections killed off as per the net.ipv4.tcp_fin_timeout
, or does it actually do so?
Well, that's the thing. i think they do time out, but new just keep showing up, but I'd have to verify that.
If it does not respect the timeout than yes, theoretically the OS could run out of port pairs. But if it does then it's not really an issue.
If it does not respect the timeout than yes, theoretically the OS could run out of port pairs. But if it does then it's not really an issue.
Yeah, exactly this - otherwise I'm fine if we just leave it be for the time being.
Let me verify that then.
Though if this is an issue that seems to be made more severe by combining with libp2p
it might be sensible to discourage users from using Docker as a way to run Nimbus until we figure out what's happening for real. Another thing to do would be to try to do a Docker setup without using our Ansible roles/Cloud provider and see if it happens as well. If it does then it's not just our fleet that shouldn't use Docker for beacon node.
Okay, so I verified that the host has the timeout set to 60
:
admin@testing-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % cat /proc/sys/net/ipv4/tcp_fin_timeout
60
And I picked a specific IP and port that was one of the connections stuck on FIN-WAIT-2
and watched it for over 60 seconds:
admin@testing-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % while true; do sleep 5; ss -H -4 state FIN-WAIT-2 | grep 172.17.2.1:60964; done
tcp 0 0 172.17.2.1:60964 172.17.2.2:9100
tcp 0 0 172.17.2.1:60964 172.17.2.2:9100
tcp 0 0 172.17.2.1:60964 172.17.2.2:9100
... (more than 60 seconds passed) ...
tcp 0 0 172.17.2.1:60964 172.17.2.2:9100
tcp 0 0 172.17.2.1:60964 172.17.2.2:9100
tcp 0 0 172.17.2.1:60964 172.17.2.2:9100
As far as I can tell it ignores the OS timeout for FIN-WAIT
state. Pretty crazy.
Maybe it ignores the timeout because the socket is still in use by the program.
And restarting the container does indeed release the FIN-WAIT-2
connections. So it must be because the sockets are kept open by the program.
Hmm, maybe this is affected by the tcp_fin_timeout
set in the container as well? This is indeed puzzling...
From https://benohead.com/blog/2013/07/21/tcp-about-fin_wait_2-time_wait-and-close_wait/
FIN_WAIT_2
If many sockets which were connected to a specific remote application end up stuck in this state, it usually indicates that the remote application either always dies unexpectedly when in the CLOSE_WAIT state or just fails to perform an active close after the passive close.
The timeout for sockets in the FIN-WAIT-2 state is defined with the parameter tcp_fin_timeout. You should set it to value high enough so that if the remote end-point is going to perform an active close, it will have time to do it. On the other hand sockets in this state do use some memory (even though not much) and this could lead to a memory overflow if too many sockets are stuck in this state for too long.
So, it should release as per tcp_fin_timeout
, unless it's omitting something important or I'm reading it wrong?
Container should inherit it from the host, and it does:
admin@testing-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % d exec -it beacon-node-pyrmont-testing-large bash
root@4b61aa05fcc2:/# cat /proc/sys/net/ipv4/tcp_fin_timeout
60
So, it should release as per
tcp_fin_timeout
, unless it's omitting something important or I'm reading it wrong?
I can read up on that.
Though if this is an issue that seems to be made more severe by combining with
libp2p
it might be sensible to discourage users from using Docker as a way to run Nimbus until we figure out what's happening for real. Another thing to do would be to try to do a Docker setup without using our Ansible roles/Cloud provider and see if it happens as well. If it does then it's not just our fleet that shouldn't use Docker for beacon node.
Well, if users do run into issues with this setup, we should hear about it pretty soon because they would theoretically stop attesting. The fact that we haven't, to me atleast, indicates that this either hasn't caused any issues so far or it just doesn't happen. Either way, I think it's too soon to say that this is a widespread issue, maybe we should alert users that this might manifest so they keep a closer eye? We can work on properly communicating this over all the appropriate channels.
Well, if users do run into issues with this setup, we should hear about it pretty soon because they would theoretically stop attesting.
Well, depends how fast it grows. And since recently releases have been quite frequent it's likely that people have been restarting frequently enough to not have any actual issues because of it. But yes, it's too early to tell.
OK, it sounds sensible to communicate it and see if we hear any users confirming this. This needs to be carefully worded tho, as to avoid being inundated with false positives.
I've deployed a Pyrmont node using modified Docker Compose file like we use on our Pyrmon fleet on my personal cloud host.
Except the host runs a different OS(NixOS) and a different cloud setup. I'll monitor it to see if it shows the same symptoms.
Oh no, the official Nimbus Docker image doesn't use -d:insecure
so I don't get metrics. That's a pain. But I guess it makes sense why it wouldn't.
Hmm, I think we recommend running metrics on by default. Not sure why the docker image would be different?
@stefantalpalaru is there any reason why we aren't running our docker images with metrics on?
Also, it would be nice if the HTTP server for metrics returned some kind of hint as to why there are no metrics.
Right now all I get is:
> curl -sv localhost:9300/metrics
* Trying 127.0.0.1:9300...
* Connected to localhost (127.0.0.1) port 9300 (#0)
> GET /metrics HTTP/1.1
> Host: localhost:9300
> User-Agent: curl/7.74.0
> Accept: */*
>
* Empty reply from server
* Connection #0 to host localhost left intact
A response saying "you need to compile with -d:insecure to get metrics" would probably make some users less confused.
@stefantalpalaru is there any reason why we aren't running our docker images with metrics on?
The HTTP server used for metrics is considered insecure, and we don't want to ship insecure software.
Also, it would be nice if the HTTP server for metrics returned some kind of hint as to why there are no metrics.
No can do. If metrics aren't enabled, that HTTP server is not running. The best we can do is show an error on the "--metrics" param.
Right. Truth be told if a flag is ignored an error and non-0 exit code would make more sense to me. But to each their own.
No can do. If metrics aren't enabled, that HTTP server is not running. The best we can do is show an error on the "--metrics" param.
Should probably exit with non 0, if not compiled with -d:insecure
and --metrics
was passed.
So far I'm not seeing any stuck connections on my personal host:
> sudo ss -H -4 state FIN-WAIT-2
Which suggests that this is indeed a combination of multiple factors that makes this happen.
So far I'm not seeing any stuck connections on my personal host:
I assume you're running it with docker?
Yes, as I said in https://github.com/status-im/infra-nimbus/issues/35#issuecomment-775342047.
@jlokier made a good point on Discord about using the NAT port mapping for Docker instead of docker-proxy
.
I'll look into that next.
I've reverted the unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont
host to using Docker for now.
I've reverted the
unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont
host to using Docker for now.
Have you disable Nat? Running while true; do sleep 5; ss -H -4 state FIN-WAIT-2 | wc -l; done
shows a steady increase of FIN-WAIT-2
, sadly.
No, I just undid the temporary systemd setup. And NAT is what I want to try using, instead of docker-proxy
.
But I'm busy today dealing with Consul update.
No, I just undid the temporary systemd setup. And NAT is what I want to try using, instead of
docker-proxy
.But I'm busy today dealing with Consul update.
Got you! No worries, just trying to follow along...
It appears that the userland-proxy
setting is enabled by default because of need for backwards compatibility with RHEL6.
From what I've read disabling it on our hosts which run Ubuntu 20.04.1 should not be an issue. But I will have to test this thoroughly before I try to roll it out widely.
I've tested this a bit on node-01.do-ams3.eth.test
and it seems to work well. I saw no issues.
I've manually changed /etc/docker/daemon.json
on unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont
and now the ports exposed by the container appears as dockerd
instead of docker-proxy
:
admin@unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont:~ % sudo netstat -lpnt | grep dockerd
tcp 0 0 0.0.0.0:9100 0.0.0.0:* LISTEN 567/dockerd
tcp 0 0 0.0.0.0:9300 0.0.0.0:* LISTEN 567/dockerd
tcp 0 0 127.0.0.1:11000 0.0.0.0:* LISTEN 567/dockerd
As compared to another host which has one per mapped port:
admin@unstable-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % sudo ps x | grep docker-proxy
338029 ? Sl 0:00 /usr/bin/docker-proxy -proto tcp -host-ip 127.0.0.1 -host-port 11000 -container-ip 172.17.2.2 -container-port 11000
338043 ? Sl 0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9300 -container-ip 172.17.2.2 -container-port 9300
338056 ? Sl 0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9100 -container-ip 172.17.2.2 -container-port 9100
338068 ? Sl 0:00 /usr/bin/docker-proxy -proto udp -host-ip 0.0.0.0 -host-port 9100 -container-ip 172.17.2.2 -container-port 9100
The port appears correctly open:
> sudo nmap -Pn -p9100 unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont
Starting Nmap 7.80 ( https://nmap.org ) at 2021-02-11 15:54 CET
Nmap scan report for unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont (18.195.225.101)
Host is up (0.025s latency).
rDNS record for 18.195.225.101: ec2-18-195-225-101.eu-central-1.compute.amazonaws.com
PORT STATE SERVICE
9100/tcp open jetdirect
Nmap done: 1 IP address (1 host up) scanned in 0.09 seco
Though I do wish I had something like node-canary
from status-go
but for nimbus-eth2
to do a more thorough tests.
Peers recovered quickly:
I'm gonna deploy this setup to a few more hosts and leave it be for a day.
I've deployed it to two more hosts:
testing-small-04.aws-eu-central-1a.nimbus.pyrmont
unstable-small-04.aws-eu-central-1a.nimbus.pyrmont
I'll check them tomorrow. If everything is fine I'll deploy the userland-proxy: false
to all Nimbus hosts.
Since based on what I read it generally improves performance I'm also going to deploy it to the rest of our infra.
Oh, and so far the libp2p
host has no FIN-WAIT-2
connections:
admin@unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont:~ % ss -H -4 state FIN-WAIT-2 | wc -l
0
Peers recovered quickly:
I'm gonna deploy this setup to a few more hosts and leave it be for a day.
Interesting how the graph changed from smooth line to a more unstable bumpy one 🤔
Just look at the previous restart. It's normal:
I'm not seeing anything wrong with the nodes in the graphs. I'll deploy this setup to all Pyrmon nodes.
It seems to work fine, with one exception. I keep seeing alerts from the Nimbus libp2p port timing out randomly and the recovering.
I don't get why this is happening. I will revert the change for the weekend to not get pinged all the time.
Can we disable the alert there for the time being and instead see if there is any degradation due to the timing out?
I guess.
Done.
We're seeing a lot of connections stuck in
FIN_WAIT2
state on Pyrmont fleet hosts.