github / glb-director

GitHub Load Balancer Director and supporting tooling.
Other
2.37k stars 227 forks source link

conntrack lookup removal in ipt_GLBREDIRECT breaks with network namespaces #111

Open jstangroome opened 4 years ago

jstangroome commented 4 years ago

The change to ipt_GLBREDIRECT implemented in PR #67 and discussed in issue #50 breaks deployments where the listening socket is in a different network namespace to where the -j GLBREDIRECT iptables rule is installed.

The observed behaviour is that GUE-encapsulated TCP SYN packets are accepted but all subsequent GUE packets for the same TCP session are then forwarded to the next-hop specified in the GUE private data, instead of being accepted locally.

Taking current master (commit 5387908) and reverting just the PR #67 merge commit 5e1edd0, i.e. git revert -m1 5e1edd0 corrects the behaviour. The behaviour is also mitigated by configuring the GLB with only a single backend since there is no next-hop to forward to but this is not very useful in practice.

The assumption is that the inet_lookup_established call is only considering ESTABLISHED sockets in the host network namespace and the now deleted conntrack lookup code does not exist to discover the conntrack entries related to having directed the connection to another network namespace.

One example where this occurs is on a Kubernetes node with the ip fou tunnel and GLBREDIRECT iptables rule configured on the host network namespace, while an nginx-ingress controller Pod listens on TCP sockets 80 and 443 inside the Pod's network namespace and traffic is routed from the host to the Pod via DNAT iptables rules added by the Kubernetes CNI. I expect the same behaviour can be reproduced without Kubernetes, such as with a Docker container's network namespace, or even just with ip netns add, ip netns exec and appropriate NAT rules.

The problem was experienced on Ubuntu 18.04.5 with kernel 5.4.0-42-generic.

I have not confirmed but I suspect that configuring the fou tunnel and the GLBREDIRECT iptables rule inside the Pod network namespace would also resolve the fault but this is less maintainable in a Kubernetes ingress controller context.

Possible options to fix ipt_GLBREDIRECT:

theojulienne commented 4 years ago

Thanks for reporting this! It's certainly an interesting issue.

I think this generally is a new use case, where iptables NAT is considered a "locally established connection", it shouldn't really matter where the remote side is. You could imagine, for example, if that DNAT directed traffic off the local host (often the case with Kubernetes nodeports, for example), then the connection wouldn't appear established locally regardless of which namespace we looked under.

This sort of leads me to think that the right answer is to add a mode/option to the iptables module to support looking at conntrack for the purposes of allowing NAT-only "sessions" to match, or just bringing back the function but explicitly stating that the module supports it for the purposes of keeping NAT sessions functional.