Investigate large volume of connections in FIN_WAIT2 state

jakubgs commented 3 years ago

We're seeing a lot of connections stuck in FIN_WAIT2 state on Pyrmont fleet hosts.

dryajov commented 3 years ago

If you look at https://github.com/docker/for-linux/issues/335, there isn't any explanation either and yet it happens across many setups. To figure out what's causing it at this point would require some heavy duty digging into the OS and Docker TCP stacks - which I don't think we have the time nor the resources for, ATM.

This is also something that hasn't been widely reported by any other user and we know for a fact that some do run docker, so quite possibly this can be a combination of our cloud provider and some specific middleware, the OS, Docker and then libp2p - which one it is beats me and like I said, I don't think we neither have the resources nor is it worth spending the time tracking it down at this time.

I suggest we look for acceptable workarounds for now and address this properly once it has been confirmed to be a wider issue (or as I suspect, docker fixes it in a future release, but so far they haven't acknowledged it as an issue, IIUC).

So let's look at our options:

Downgrade docker to a v17.0.9 which seems to fix it as per the docker issues linked above
Remove docker altogether for the time being
Rebuild this boxes from scratch
Try it on a different hardware/OS combo
Further investigate if there is any middleware between the boxes internal network and the wider internet
???

jakubgs commented 3 years ago

This is also something that hasn't been widely reported by any other user and we know for a fact that some do run docker, so quite possibly this can be a combination of our cloud provider and some specific middleware, the OS, Docker and then libp2p

Yes, I also think that's possible. Which is why this is such a pain. It's most probably a combination of a few things.

dryajov commented 3 years ago

Yeah, it's a pesky issue for sure and I think right now it comes down to mitigating the cost by finding a reasonable workaround rather than trying to pinpoint what it is exactly - which can be a time sink of epic proportions.

jakubgs commented 3 years ago

I agree. Though my question would be. What's the harm? I mean, are we just worried that this will cause some issues in the future? Because as far as I know there is nothing wrong with these old connections lingering. Unless there's a symptom that I'm not aware of.

jakubgs commented 3 years ago

Simplest solution is indeed just running it directly via Systemd. But that leaves the issue of distributing new builds. We'll need a new setup for doing that.

jakubgs commented 3 years ago

But my question still stands. Why is this an issue, other than it's just not pretty? Is this causing any actual degradation in functionality of the node?

dryajov commented 3 years ago

I agree. Though my question would be. What's the harm? I mean, are we just worried that this will cause some issues in the future? Because as far as I know there is nothing wrong with these old connections lingering. Unless there's a symptom that I'm not aware of.

Oh, it's definitely a resource leak. This port pair isn't usable anymore so the OS can run out of port pairs at some point.

For example, when a remote establishes a connection to port 9001, the OS will assign a random incoming port in the form of xxxx:9001, if this connection never closes properly, the OS won't release that port pair and it won't be reusable until the socket is destroyed, which means that new remotes won't be able to connect anymore once the OS runs out of port pairs.

dryajov commented 3 years ago

The one thing that puzzles me tho, is why aren't this connections killed off as per the net.ipv4.tcp_fin_timeout, or does it actually do so?

jakubgs commented 3 years ago

Well, that's the thing. i think they do time out, but new just keep showing up, but I'd have to verify that.

jakubgs commented 3 years ago

If it does not respect the timeout than yes, theoretically the OS could run out of port pairs. But if it does then it's not really an issue.

dryajov commented 3 years ago

If it does not respect the timeout than yes, theoretically the OS could run out of port pairs. But if it does then it's not really an issue.

Yeah, exactly this - otherwise I'm fine if we just leave it be for the time being.

jakubgs commented 3 years ago

Let me verify that then.

jakubgs commented 3 years ago

Though if this is an issue that seems to be made more severe by combining with libp2p it might be sensible to discourage users from using Docker as a way to run Nimbus until we figure out what's happening for real. Another thing to do would be to try to do a Docker setup without using our Ansible roles/Cloud provider and see if it happens as well. If it does then it's not just our fleet that shouldn't use Docker for beacon node.

jakubgs commented 3 years ago

Okay, so I verified that the host has the timeout set to 60:

admin@testing-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % cat /proc/sys/net/ipv4/tcp_fin_timeout
60

And I picked a specific IP and port that was one of the connections stuck on FIN-WAIT-2 and watched it for over 60 seconds:

admin@testing-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % while true; do sleep 5; ss -H -4 state FIN-WAIT-2 | grep 172.17.2.1:60964; done
tcp     0          0              172.17.2.1:60964          172.17.2.2:9100     
tcp     0          0              172.17.2.1:60964          172.17.2.2:9100     
tcp     0          0              172.17.2.1:60964          172.17.2.2:9100  
... (more than 60 seconds passed) ...
tcp     0          0              172.17.2.1:60964          172.17.2.2:9100     
tcp     0          0              172.17.2.1:60964          172.17.2.2:9100     
tcp     0          0              172.17.2.1:60964          172.17.2.2:9100

As far as I can tell it ignores the OS timeout for FIN-WAIT state. Pretty crazy.

jakubgs commented 3 years ago

Maybe it ignores the timeout because the socket is still in use by the program.

jakubgs commented 3 years ago

And restarting the container does indeed release the FIN-WAIT-2 connections. So it must be because the sockets are kept open by the program.

dryajov commented 3 years ago

Hmm, maybe this is affected by the tcp_fin_timeout set in the container as well? This is indeed puzzling...

From https://benohead.com/blog/2013/07/21/tcp-about-fin_wait_2-time_wait-and-close_wait/

FIN_WAIT_2
If many sockets which were connected to a specific remote application end up stuck in this state, it usually indicates that the remote application either always dies unexpectedly when in the CLOSE_WAIT state or just fails to perform an active close after the passive close.

The timeout for sockets in the FIN-WAIT-2 state is defined with the parameter tcp_fin_timeout. You should set it to value high enough so that if the remote end-point is going to perform an active close, it will have time to do it. On the other hand sockets in this state do use some memory (even though not much) and this could lead to a memory overflow if too many sockets are stuck in this state for too long.

So, it should release as per tcp_fin_timeout, unless it's omitting something important or I'm reading it wrong?

jakubgs commented 3 years ago

Container should inherit it from the host, and it does:

admin@testing-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % d exec -it beacon-node-pyrmont-testing-large bash
root@4b61aa05fcc2:/# cat /proc/sys/net/ipv4/tcp_fin_timeout
60

So, it should release as per tcp_fin_timeout, unless it's omitting something important or I'm reading it wrong?

I can read up on that.

dryajov commented 3 years ago

Though if this is an issue that seems to be made more severe by combining with libp2p it might be sensible to discourage users from using Docker as a way to run Nimbus until we figure out what's happening for real. Another thing to do would be to try to do a Docker setup without using our Ansible roles/Cloud provider and see if it happens as well. If it does then it's not just our fleet that shouldn't use Docker for beacon node.

Well, if users do run into issues with this setup, we should hear about it pretty soon because they would theoretically stop attesting. The fact that we haven't, to me atleast, indicates that this either hasn't caused any issues so far or it just doesn't happen. Either way, I think it's too soon to say that this is a widespread issue, maybe we should alert users that this might manifest so they keep a closer eye? We can work on properly communicating this over all the appropriate channels.

jakubgs commented 3 years ago

Well, if users do run into issues with this setup, we should hear about it pretty soon because they would theoretically stop attesting.

Well, depends how fast it grows. And since recently releases have been quite frequent it's likely that people have been restarting frequently enough to not have any actual issues because of it. But yes, it's too early to tell.

dryajov commented 3 years ago

OK, it sounds sensible to communicate it and see if we hear any users confirming this. This needs to be carefully worded tho, as to avoid being inundated with false positives.

jakubgs commented 3 years ago

I've deployed a Pyrmont node using modified Docker Compose file like we use on our Pyrmon fleet on my personal cloud host.

Except the host runs a different OS(NixOS) and a different cloud setup. I'll monitor it to see if it shows the same symptoms.

jakubgs commented 3 years ago

Oh no, the official Nimbus Docker image doesn't use -d:insecure so I don't get metrics. That's a pain. But I guess it makes sense why it wouldn't.

dryajov commented 3 years ago

Hmm, I think we recommend running metrics on by default. Not sure why the docker image would be different?

dryajov commented 3 years ago

@stefantalpalaru is there any reason why we aren't running our docker images with metrics on?

jakubgs commented 3 years ago

Also, it would be nice if the HTTP server for metrics returned some kind of hint as to why there are no metrics.

Right now all I get is:

 > curl -sv localhost:9300/metrics
*   Trying 127.0.0.1:9300...
* Connected to localhost (127.0.0.1) port 9300 (#0)
> GET /metrics HTTP/1.1
> Host: localhost:9300
> User-Agent: curl/7.74.0
> Accept: */*
> 
* Empty reply from server
* Connection #0 to host localhost left intact

A response saying "you need to compile with -d:insecure to get metrics" would probably make some users less confused.

stefantalpalaru commented 3 years ago

@stefantalpalaru is there any reason why we aren't running our docker images with metrics on?

The HTTP server used for metrics is considered insecure, and we don't want to ship insecure software.

Also, it would be nice if the HTTP server for metrics returned some kind of hint as to why there are no metrics.

No can do. If metrics aren't enabled, that HTTP server is not running. The best we can do is show an error on the "--metrics" param.

jakubgs commented 3 years ago

Right. Truth be told if a flag is ignored an error and non-0 exit code would make more sense to me. But to each their own.

dryajov commented 3 years ago

No can do. If metrics aren't enabled, that HTTP server is not running. The best we can do is show an error on the "--metrics" param.

Should probably exit with non 0, if not compiled with -d:insecure and --metrics was passed.

jakubgs commented 3 years ago

So far I'm not seeing any stuck connections on my personal host:

 > sudo ss -H -4 state FIN-WAIT-2

Which suggests that this is indeed a combination of multiple factors that makes this happen.

dryajov commented 3 years ago

So far I'm not seeing any stuck connections on my personal host:

I assume you're running it with docker?

jakubgs commented 3 years ago

Yes, as I said in https://github.com/status-im/infra-nimbus/issues/35#issuecomment-775342047.

jakubgs commented 3 years ago

@jlokier made a good point on Discord about using the NAT port mapping for Docker instead of docker-proxy.

I'll look into that next.

jakubgs commented 3 years ago

I've reverted the unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont host to using Docker for now.

dryajov commented 3 years ago

I've reverted the unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont host to using Docker for now.

Have you disable Nat? Running while true; do sleep 5; ss -H -4 state FIN-WAIT-2 | wc -l; done shows a steady increase of FIN-WAIT-2, sadly.

jakubgs commented 3 years ago

No, I just undid the temporary systemd setup. And NAT is what I want to try using, instead of docker-proxy.

But I'm busy today dealing with Consul update.

dryajov commented 3 years ago

No, I just undid the temporary systemd setup. And NAT is what I want to try using, instead of docker-proxy.

But I'm busy today dealing with Consul update.

Got you! No worries, just trying to follow along...

jakubgs commented 3 years ago

It appears that the userland-proxy setting is enabled by default because of need for backwards compatibility with RHEL6.

From what I've read disabling it on our hosts which run Ubuntu 20.04.1 should not be an issue. But I will have to test this thoroughly before I try to roll it out widely.

Docs

Articles

Issues

https://github.com/moby/moby/issues/14856 - Disable Userland proxy by default
https://github.com/moby/moby/issues/11185 - user land proxy uses all RAM memory when exposing a big range of ports
https://github.com/moby/moby/issues/36214 - When using userland-proxy=false many iptables entries instead of multiport

jakubgs commented 3 years ago

I've tested this a bit on node-01.do-ams3.eth.test and it seems to work well. I saw no issues.

I've manually changed /etc/docker/daemon.json on unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont and now the ports exposed by the container appears as dockerd instead of docker-proxy:

admin@unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont:~ % sudo netstat -lpnt | grep dockerd
tcp        0      0 0.0.0.0:9100            0.0.0.0:*               LISTEN      567/dockerd         
tcp        0      0 0.0.0.0:9300            0.0.0.0:*               LISTEN      567/dockerd         
tcp        0      0 127.0.0.1:11000         0.0.0.0:*               LISTEN      567/dockerd

As compared to another host which has one per mapped port:

admin@unstable-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % sudo ps x | grep docker-proxy 
 338029 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 127.0.0.1 -host-port 11000 -container-ip 172.17.2.2 -container-port 11000
 338043 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9300 -container-ip 172.17.2.2 -container-port 9300
 338056 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9100 -container-ip 172.17.2.2 -container-port 9100
 338068 ?        Sl     0:00 /usr/bin/docker-proxy -proto udp -host-ip 0.0.0.0 -host-port 9100 -container-ip 172.17.2.2 -container-port 9100

jakubgs commented 3 years ago

The port appears correctly open:

 > sudo nmap -Pn -p9100 unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont     
Starting Nmap 7.80 ( https://nmap.org ) at 2021-02-11 15:54 CET
Nmap scan report for unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont (18.195.225.101)
Host is up (0.025s latency).
rDNS record for 18.195.225.101: ec2-18-195-225-101.eu-central-1.compute.amazonaws.com

PORT     STATE SERVICE
9100/tcp open  jetdirect

Nmap done: 1 IP address (1 host up) scanned in 0.09 seco

Though I do wish I had something like node-canary from status-go but for nimbus-eth2 to do a more thorough tests.

jakubgs commented 3 years ago

Peers recovered quickly:

I'm gonna deploy this setup to a few more hosts and leave it be for a day.

jakubgs commented 3 years ago

I've deployed it to two more hosts:

testing-small-04.aws-eu-central-1a.nimbus.pyrmont
unstable-small-04.aws-eu-central-1a.nimbus.pyrmont

I'll check them tomorrow. If everything is fine I'll deploy the userland-proxy: false to all Nimbus hosts.

Since based on what I read it generally improves performance I'm also going to deploy it to the rest of our infra.

jakubgs commented 3 years ago

Oh, and so far the libp2p host has no FIN-WAIT-2 connections:

admin@unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont:~ % ss -H -4 state FIN-WAIT-2 | wc -l
0

dryajov commented 3 years ago

Peers recovered quickly:

I'm gonna deploy this setup to a few more hosts and leave it be for a day.

Interesting how the graph changed from smooth line to a more unstable bumpy one 🤔

jakubgs commented 3 years ago

Just look at the previous restart. It's normal:

jakubgs commented 3 years ago

I'm not seeing anything wrong with the nodes in the graphs. I'll deploy this setup to all Pyrmon nodes.

jakubgs commented 3 years ago

It seems to work fine, with one exception. I keep seeing alerts from the Nimbus libp2p port timing out randomly and the recovering.

I don't get why this is happening. I will revert the change for the weekend to not get pinged all the time.

dryajov commented 3 years ago

Can we disable the alert there for the time being and instead see if there is any degradation due to the timing out?

jakubgs commented 3 years ago

I guess.

jakubgs commented 3 years ago

Done.

status-im / infra-nimbus