Closed lbernail closed 5 years ago
I'm trying to find a solution but it is not easy:
Scenario 1: single pod service, accessing the nodeport on the node where the service is located
To avoid martian packets, disable rp_filter. After this we get
With:
Scenario 2: externalTrafficPolicy: Local
Slightly different because in that case traffic is not source-natted on host (this option is used for performance and to keep the source IP).
To avoid martian packets, disable rp_filter. After this we get
I have a workaround for the first scenario:
iptables -A PREROUTING -i ens3 -t mangle -p tcp --dport 30000:32767 -j CONNMARK --set-mark 42
iptables -t mangle -A PREROUTING -j CONNMARK -i veth+ --restore-mark
ip rule add fwmark 42 lookup main pref 1024
This is generic and only needs to be done once regardless of the number of ENI/pods
I have a similar solution for scenario 2 but it is more complicated because it needs to be applied to all pods:
iptables -A PREROUTING -i veth0 -t mangle -s ! 172.30.187.226 -j CONNMARK --set-mark 42
iptables -t mangle -A OUTPUT -j CONNMARK --restore-mark
ip rule add fwmark 42 lookup 100 pref 1024
ip route add default via 172.30.187.226 dev veth0 table 100
ip route add 172.30.187.226 dev veth0 scope link table 100
Where 172.30.187.226 is the host IP
This assumes that all traffic from veth and not from host ip is nodeport traffic
Both solutions work but add a lot of complexity. I hope we can find a simpler solution
NodePort is not something on our test matrix, as we either do direct pod-routing with Envoy, or in a few cases using an Ingress controller on a fixed pool (obviously not for huge traffic). If you do not need security groups, we have experimented using the Network Load Balancers IP-direct functionality (you can attach VPC IPs as endpoints), using a control loop running in the cluster looking for annotations to bind to the NLB. This bypasses host networking and kube-proxy rules entirely. I don't have any code to share for this controller at hand, but may be a higher performance path (excepting the issue with security groups not being supported in NLBs).
Let me look over the proposed changes to see if I can find a simpler route.
Thank you for the answer. I figured you were not using load-balancer services, otherwise you would have encountered the issue before The idea of a custom controller to bind pod IPs to a NLB directly is very interesting (this is something we have also started to discuss on our side). Do you plan to opensource the controller?
Is it possible that https://github.com/lyft/cni-ipvlan-vpc-k8s/pull/42 and running on a single subnet causes this behavior? @theatrus tried to reproduce this issue using our multi-subnet setup but was unable to do so earlier this week. I still haven't had time to dig into this issue, but as soon as the rc is out, I want to tackle this.
Thank you for looking into it
I don't see why having separate subnets would solve this because the routing asymmetry is between the primary host interface (incoming traffic on nodeport) and the pod veth (outgoing interface for traffic coming from a pod with a destination in the pod CIDR range)
Let me know if you need any help reproducing it
Fixed by #44
When setting up a Nodeport if we access a host where the target service is running and load-balancing chooses the local pod, the traffic is dropped.
Everything seems to work OK because if the first SYN is dropped the client will retry (however queries load-balanced to the local pod take much longer) and will (probably) be sent to another host.
This can be seen by logging martian packets. When traffic is sent to a local pod it will be dropped with the following log:
To trigger the issue I simply did this until the answer took more than 1s:
where
172.30.183.34
is the host IP and30054
the nodeport. The Kube-proxy Nodeport iptables prerouting rules redirected traffic to172.30.182.212
(local pod for the service) which triggered the martian log.Looking at routing explains the issue:
This means that traffic arrives on ens3 but the reverse route is through the pod (the route getting back to the pod is necessary to access services).
To trigger the issue consistently (100% of the time) we just need to add
externalTrafficPolicy: Local
to the service definition (or scale the service down to 1 pod)