HAProxy/pfSense client ephemeral port exhaustion

The current neonHIVE configuration can run into Linux SNAT/DNAT port exhaustion related issues when scaling the network traffic to medium or high loads. This problem can surface due to the Docker ingress/mesh network DNAT iptables rules but it can also happen in other places like the pfSense DMZ load balancer rules that direct external traffic to to cluster nodes.

There appear to be two somewhat related problems:

At high load, traffic being proxied by a load balancer or transformed by a DNAT will have the same source IP so only the source port can be varied when establishing a connection to the backend server. When the backend connection is closed, the source port will go into the TIME_WAIT for 2 minutes (on Linux) and cannot be reused again during this time. neonHIVE currently configures the kernel to allocate ephemeral ports in the range 9000-65535 (56535 ports) so assuming each backend connection is closed immediately so that the source port goes into TIME_WAIT, the maximum connections/sec is 56535/120 = 471/sec per hive host.
It also appears to be a Linux kernel race condition that can cause two inbound connections to be assigned the same DNAT source port resulting in SYN packets being dropped and then re-transmission delays. This is discussed in detail here. Note that this is not a Docker specific issue, it happens in Kubernetes too).

There are some possible mitigations:

[ ] Have neon-proxy-manager configure backend HTTP connections to remain alive where ever possible. I believe this is the default, but I should verify that I'm not disabling this (perhaps making this a load balancer rule option). This should go a long way towards preventing source port exhaustion in the neon-proxy-public and neon-proxy-private containers.
[x] I am not currently configuring the source port range in the public or private proxy containers (I assumed this would be picked up from the hive host). I now doubt that this is actually true. In any case, I should modify the neon-proxy container to use sysctl to set net.ipv4.ip_local_port_range = 1024 65535 which is probably the largest number of possible ports.
NOTE: I tried setting net.ipv4.ip_local_port_range in the HAProxy Dockerfiles and also live within a running cointainer. This doesn't work for Docker containers by design since the namespaced container network stack is managed by the Docker engine. It looks like it's possible to pass --sysctl options to docker run ... but this option is not available for services. So it looks like about 32K ports is about all we can get.
[ ] Have neon-proxy-manager keep alive inbound HTTP connections too. I believe I'm currently closing connections which will result in possible port exhaustion at the pfSense load balancer as well as potentially poor latency due to having to establish new connections.
[ ] It appears that it will be possible (in the future) to mitigate issue 2 above by having Docker specifying the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag when generating its DOCKER-INGRESS DNAT rules. The latest version of iptables supports the --random-fully option but this version of iptables is not in the current kernel and Docker isn't currently generating this option anyway. One possible hack if this becomes unbearable might be to munge the iptables DNAT module so that it always sets this flag and deploy this to the hive hosts and perhaps even the pfSense boxes.
[ ] We can also look into having HAProxy route traffic to each backend via multiple network interfaces. neon-proxy-manager could generate these automatically but I wonder if it's possible to have Docker assign more than one interface to a container. We might also do the same thing with pfSense by assigning multiple network interfaces and munging the HAProxy backends (perhaps by hand).
[ ] If we're not able to assign multiple IPs to an HAProxy container, we can also simply deploy more of these containers to accomplish the same thing (at this cost of additional backend health checks).
[ ] There's something called iproute2 which looks like it can be used to mitigate port starvation. I don't understand this yet but it appears that you assign additional IP addresses to an interface. I wonder if this would work in a container.

Here are the links I found while researching this:

https://stackoverflow.com/questions/10085705/load-balancer-scalability-and-max-tcp-ports https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 https://github.com/tsenart/vegeta vegeta load generator

https://github.com/moby/moby/issues/35082 http://archive.linuxvirtualserver.org/html/lvs-devel/2015-10/msg00067.html https://medium.freecodecamp.org/how-we-fine-tuned-haproxy-to-achieve-2-000-000-concurrent-ssl-connections-d017e61a4d27 https://www.linangran.com/?p=547

The first two links really describe the problem. The third link is to the vegeta load generator project that looks like it's better than the Apache load generator we've been using.

nforgeio / neonKUBE

HAProxy/pfSense client ephemeral port exhaustion #275