status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 6 forks source link

Investigate large volume of connections in FIN_WAIT2 state #35

Closed jakubgs closed 3 years ago

jakubgs commented 3 years ago

We're seeing a lot of connections stuck in FIN_WAIT2 state on Pyrmont fleet hosts.

stefantalpalaru commented 3 years ago

I don't get why this is happening.

Maybe because the event loop is busy doing something else instead of networking. Don't forget we're using a single thread for everything in there.

Since you no longer have a proxy in the middle responding to that ping, you're gazing directly into our event loop.

jakubgs commented 3 years ago

Yeah, that's a sensible explanation. I'll have to look into it. I did glimpse it for a bit using nc -zv localhost 9100 where I was just getting nothing for a bit, and then it went straight back to responding just fine. So you might be right about that.

If that's indeed the issue in theory increasing the timeout on the TCP healthcheck should make it not trigger when that happens.

dryajov commented 3 years ago

Maybe because the event loop is busy doing something else instead of networking. Don't forget we're using a single thread for everything in there.

I wonder how this alerts are being generated and what's exactly timing out? I don't doubt that we have jerkines in the event loop, but if it's this bad, I'd expect it breaking the client in several ways. So far I haven't seen it and it's been pretty stable overall.

jakubgs commented 3 years ago

It's just a simple TCP 3-way handshake check, it just establishes a connection and terminates it right away. The default timeout is 2 seconds:

https://github.com/status-im/infra-role-consul-service/blob/b1d5ad5caa7d7a036fd175292fa497175bb7c54c/templates/service.json.j2#L59

So it's possible. I've increased it to 5 seconds for now.

jakubgs commented 3 years ago

Btw, the counts of FIN-WAIT-2 state connections are very low, so I'd say the docker-proxy in combination with some other elements did indeed cause the stuck connections:

 > ansible nimbus.pyrmont -o -a 'ss -H -4 state FIN-WAIT-2 | wc -l' | sort -h
stable-large-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
stable-small-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 2
testing-large-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 2
testing-small-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 2
testing-small-02.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 2
testing-small-03.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 2
testing-small-04.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 2
unstable-large-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 2
unstable-large-02.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 2
unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
unstable-small-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
unstable-small-02.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
unstable-small-03.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
unstable-small-04.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 1

So it might be sensible to inform users through some channel that use of docker-proxy/Docker userland-proxy is discouraged.

dryajov commented 3 years ago

The default timeout is 2 seconds:

OK, 2 secs should be tolerable so quite possibly it is the event loop.

So it might be sensible to inform users through some channel that use of docker-proxy/Docker userland-proxy is discouraged.

For sure, we might want @sachayves to announce this over the appropriate channels.

Also, it would be good to update the docker ticket with a hint that the culprit might be the proxy.

Great job tracking this down @jakubgs!

jakubgs commented 3 years ago

This actually seems to last much longer than a few seconds:

admin@unstable-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % nc -w 10 -zv localhost 9300
Connection to localhost 9300 port [tcp/*] succeeded!
admin@unstable-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % nc -w 10 -zv localhost 9100
nc: connect to localhost port 9100 (tcp) timed out: Operation now in progress
admin@unstable-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % nc -w 10 -zv localhost 9300
Connection to localhost 9300 port [tcp/*] succeeded!
admin@unstable-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % nc -w 10 -zv localhost 9100
nc: connect to localhost port 9100 (tcp) timed out: Operation now in progress
admin@unstable-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % nc -w 10 -zv localhost 9300
Connection to localhost 9300 port [tcp/*] succeeded!
admin@unstable-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % nc -w 10 -zv localhost 9100
nc: connect to localhost port 9100 (tcp) timed out: Operation now in progress

As you can see the metrics port is responding just fine, while the libp2p one is not.

jakubgs commented 3 years ago

And it recovered after a few minutes:

admin@unstable-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % nc -w 10 -zv localhost 9100
Connection to localhost 9100 port [tcp/*] succeeded!

So no, increasing timeout won't fix this. I'll disable the libp2p port TCP checks as before.

dryajov commented 3 years ago

Ah, I just realized why this is happening - it's expected, since libp2p has a limit on incoming connections, once the limit is reached it wont accept any more connections and the connection will get stuck in TCP's backlog until it times out.

jakubgs commented 3 years ago

Ooooooh, interesting. But that means a more advanced canary would have the same problem. Okay, but at least we know what's happening. Nice.

dryajov commented 3 years ago

Definitely, it's going to be wonky for any sort of port monitoring.

stefantalpalaru commented 3 years ago

https://docs.docker.com/engine/reference/commandline/dockerd/ :

--userland-proxy Use userland proxy for loopback traffic (default true)

If it's only for loopback traffic, it should not affect most setups, right?

jakubgs commented 3 years ago

You're correct:

Secondly, even when Docker is able to forward packets using netfilter rules, there is one circumstance where it is not possible to apply netfilter rules. Unless told otherwise, when a container's port is forwarded to the Docker host, it will be forwarded to all of the host's interfaces, including its loopback interface. But the Linux kernel does not allow the routing of loopback traffic, and therefore it's not possible to apply netfilter NAT rules to packets originating from 127.0.0.0/8. Instead, netfilter sends packets through the filter table's INPUT chain to a local process listening on the designated port - the docker-proxy[1].

https://windsock.io/the-docker-proxy/

So one of the pieces to the puzzle is most probably the TCP healthcheck we do via Consul agent.

kdeme commented 3 years ago

Ah, I just realized why this is happening - it's expected, since libp2p has a limit on incoming connections, once the limit is reached it wont accept any more connections and the connection will get stuck in TCP's backlog until it times out.

Yeah, I was thinking of adding a more advanced canary as @jakubgs requested but considered indeed this as an issue for health checking libp2p. Practically this means we can only check the discovery protocol because of the way we handle incoming connections when max peers is reached.

jakubgs commented 3 years ago

What if there was a peer ID or IP whitelist that ignored the max peers limit? So if the node is in the list it always gets connected.

IP whitelist would be much nicer to use tho, especially if you could whitelist subnets, like our internal VPN.

jakubgs commented 3 years ago

Our status-go has the concept of trusted nodes: https://github.com/status-im/status-go/blob/61993fab47931ef03f991371e31ce33f96c3c0e1/params/config.go#L254-L255 Which do not count to the MaxPeers limit: https://github.com/status-im/status-go/blob/61993fab47931ef03f991371e31ce33f96c3c0e1/params/config.go#L372-L374

dryajov commented 3 years ago

What if there was a peer ID or IP whitelist that ignored the max peers limit? So if the node is in the list it always gets connected.

IP whitelist would be much nicer to use tho, especially if you could whitelist subnets, like our internal VPN.

We don't have any notion of trusted peers right now and it would be non trivial to add. One solution could be to add a health check protocol. You would keep a connection open and exchange pings every now and then, that way you're less likely to hit the limit often. The problem is the connect/disconnect use case of the health check.

But the easiest thing to do right now is just hit the metrics endpoint instead.

jakubgs commented 3 years ago

Seems fixed to me:

 > ansible all -o -a 'ss -H -4 state FIN-WAIT-2 | wc -l' | sort -h
goerli-01.aws-eu-central-1a.nimbus.geth | CHANGED | rc=0 | (stdout) 0
mainnet-01.aws-eu-central-1a.nimbus.geth | CHANGED | rc=0 | (stdout) 0
node-01.aws-eu-central-1a.dash.nimbus | CHANGED | rc=0 | (stdout) 0
node-01.aws-eu-central-1a.log-store.nimbus | CHANGED | rc=0 | (stdout) 0
node-02.aws-eu-central-1a.log-store.nimbus | CHANGED | rc=0 | (stdout) 0
node-03.aws-eu-central-1a.log-store.nimbus | CHANGED | rc=0 | (stdout) 0
stable-large-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
stable-small-01.aws-eu-central-1a.nimbus.mainnet | CHANGED | rc=0 | (stdout) 0
stable-small-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
stable-small-02.aws-eu-central-1a.nimbus.mainnet | CHANGED | rc=0 | (stdout) 0
testing-large-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
testing-small-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
testing-small-02.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
testing-small-03.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
testing-small-04.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
unstable-large-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
unstable-large-02.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
unstable-small-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
unstable-small-02.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
unstable-small-03.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
unstable-small-04.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0
unstable-libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont | CHANGED | rc=0 | (stdout) 0

I'm deploying the userland-proxy:false setting from https://github.com/status-im/infra-role-bootstrap/commit/79798b7a to other hosts in our fleets gradually. I think this can be closed.

I did mention this issue in today's Town Hall, but we could maybe include a note in the next release or something.

dryajov commented 3 years ago

It seems to work fine, with one exception. I keep seeing alerts from the Nimbus libp2p port timing out randomly and the recovering.

One more thing to note here, the userland proxy was potentially hiding connection errors to closed ports in the target app, so we're probably weren't getting reliable alerts from other hosts/apps as well...

jakubgs commented 3 years ago

Well, most other healtchecks are HTTP based checks that actually expect a specific response. Only services with protocols that can't be easily checked for a response have a simple TCP healthcheck. So it's rare, but yes.