lyft / cni-ipvlan-vpc-k8s

AWS VPC Kubernetes CNI driver using IPvlan
Apache License 2.0
360 stars 58 forks source link

Nodeports not working properly #38

Closed lbernail closed 5 years ago

lbernail commented 6 years ago

When setting up a Nodeport if we access a host where the target service is running and load-balancing chooses the local pod, the traffic is dropped.

Everything seems to work OK because if the first SYN is dropped the client will retry (however queries load-balanced to the local pod take much longer) and will (probably) be sent to another host.

This can be seen by logging martian packets. When traffic is sent to a local pod it will be dropped with the following log:

[912228.409488] IPv4: martian source 172.30.182.212 from 172.21.51.75, on dev ens3
[912228.409534] ll header: 00000000: 0e d8 07 a0 c0 0c 0e f0 4b 50 fd 5c 08 00        ........KP.\..

To trigger the issue I simply did this until the answer took more than 1s:

$ curl http://172.30.183.34:30054

where 172.30.183.34 is the host IP and 30054 the nodeport. The Kube-proxy Nodeport iptables prerouting rules redirected traffic to 172.30.182.212 (local pod for the service) which triggered the martian log.

Looking at routing explains the issue:

$ ip route get 172.30.182.212 from 172.21.51.75 iif ens3
RTNETLINK answers: Invalid cross-device link

$ ip route get 172.30.182.212
172.30.182.212 dev veth3b59a300  src 172.30.183.34

$ ip route get 172.21.51.75 from 172.30.182.212 iif veth3b59a300
172.21.51.75 from 172.30.182.212 via 172.30.182.212 dev veth3b59a300

This means that traffic arrives on ens3 but the reverse route is through the pod (the route getting back to the pod is necessary to access services).

To trigger the issue consistently (100% of the time) we just need to add externalTrafficPolicy: Local to the service definition (or scale the service down to 1 pod)

lbernail commented 6 years ago

I'm trying to find a solution but it is not easy:

Scenario 1: single pod service, accessing the nodeport on the node where the service is located

To avoid martian packets, disable rp_filter. After this we get

  1. received on main interface SIP:SPORT => NIP:NPORT
  2. after prerouting: SIP:SPORT => PIP:PPORT
  3. routed to pod veth
  4. postrouting: NIP:XXXX => PIP:PPORT
  5. receveid in pod
  6. answered go back on veth with PIP:PPORT => NIP:XXXX
  7. after prerouting (reverse NAT): PIP:PPORT => SIP:SPORT
  8. routing fails because traffic would be sent back to veth interface (matches the veth rule with destination in the VPC)

With:

Scenario 2: externalTrafficPolicy: Local Slightly different because in that case traffic is not source-natted on host (this option is used for performance and to keep the source IP).

To avoid martian packets, disable rp_filter. After this we get

  1. received on main interface SIP:SPORT => NIP:NPORT
  2. after prerouting: SIP:SPORT => PIP:PPORT
  3. routed to pod veth
  4. receveid in pod
  5. answered go back on eth0 with PIP:PPORT => SIP:SPORT
  6. traffic is dropped by original source because it expects an answer from NIP:NPORT
lbernail commented 6 years ago

I have a workaround for the first scenario:

  1. Mark packets received on main interface and sent to a nodeport
  2. Use conntrack to restore the same mark on packets in the reverse direction
  3. Create a rule with higher priority to force returning through main interface
iptables -A PREROUTING -i ens3 -t mangle -p tcp --dport 30000:32767 -j CONNMARK --set-mark 42
iptables -t mangle -A PREROUTING -j CONNMARK -i veth+ --restore-mark
ip rule add fwmark 42 lookup main pref 1024

This is generic and only needs to be done once regardless of the number of ENI/pods

lbernail commented 6 years ago

I have a similar solution for scenario 2 but it is more complicated because it needs to be applied to all pods:

  1. Mark packets received through veth and not from host
  2. Use conntrack to restore the same mark on packets in the reverse direction
  3. Create a rule with higher priority to force returning through veth interface for this mark using a new routing table
iptables -A PREROUTING -i veth0 -t mangle -s ! 172.30.187.226  -j CONNMARK --set-mark 42
iptables -t mangle -A OUTPUT -j CONNMARK --restore-mark
ip rule add fwmark 42 lookup 100 pref 1024
ip route add default via 172.30.187.226 dev veth0 table 100
ip route add 172.30.187.226 dev veth0  scope link table 100

Where 172.30.187.226 is the host IP

This assumes that all traffic from veth and not from host ip is nodeport traffic

Both solutions work but add a lot of complexity. I hope we can find a simpler solution

theatrus commented 6 years ago

NodePort is not something on our test matrix, as we either do direct pod-routing with Envoy, or in a few cases using an Ingress controller on a fixed pool (obviously not for huge traffic). If you do not need security groups, we have experimented using the Network Load Balancers IP-direct functionality (you can attach VPC IPs as endpoints), using a control loop running in the cluster looking for annotations to bind to the NLB. This bypasses host networking and kube-proxy rules entirely. I don't have any code to share for this controller at hand, but may be a higher performance path (excepting the issue with security groups not being supported in NLBs).

Let me look over the proposed changes to see if I can find a simpler route.

lbernail commented 6 years ago

Thank you for the answer. I figured you were not using load-balancer services, otherwise you would have encountered the issue before The idea of a custom controller to bind pod IPs to a NLB directly is very interesting (this is something we have also started to discuss on our side). Do you plan to opensource the controller?

paulnivin commented 6 years ago

Is it possible that https://github.com/lyft/cni-ipvlan-vpc-k8s/pull/42 and running on a single subnet causes this behavior? @theatrus tried to reproduce this issue using our multi-subnet setup but was unable to do so earlier this week. I still haven't had time to dig into this issue, but as soon as the rc is out, I want to tackle this.

lbernail commented 6 years ago

Thank you for looking into it

I don't see why having separate subnets would solve this because the routing asymmetry is between the primary host interface (incoming traffic on nodeport) and the pod veth (outgoing interface for traffic coming from a pod with a destination in the pod CIDR range)

Let me know if you need any help reproducing it

lbernail commented 5 years ago

Fixed by #44