Connections to Kubernetes Services with no endpoints hang

chino commented 7 years ago

When a client connects/sends to a service with no endpoints it appears to incorrectly be forwarded to the default gateway. This creates a conntrack entry that blackholes the client even after endpoints are later added. TCP will recover from a failed handshake but UDP clients can keep the conntrack entry alive forever if they keep trying to send packets from the same source port.

I believe it should normally be REJECTed by the KUBE-SERVICES chain:

Chain KUBE-SERVICES (2 references)
pkts bytes target     prot opt in     out     source               destination
    0     0 REJECT     udp  --  *      *       0.0.0.0/0            10.100.0.198         /* default/udp-service: has no endpoints */ udp dpt:4000 reject-with icmp-port-unreachable

However it appears that the cali-FORWARD chain handles the packet first (notice the counters):

core@k8s-node-01 ~ $ sudo iptables -L FORWARD -n -v
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
 478K  704M cali-FORWARD  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:wUHhoiAYhphO9Mso */
...
    0     0 KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0

There is some udp conntrack clearing code in kube-proxy but it's only for when endpoints are removed and I didn't see anything that would try to reconcile possibly broken conntrack entries that may exist for other reasons:

https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L948

Steps to Reproduce (for bugs)

My kubernetes setup was simply based on calico's vagrant tutorial:

https://docs.projectcalico.org/v2.4/getting-started/kubernetes/installation/vagrant/

Example k8s manifests:

note: if the endpoint list fills before the client starts then you wont hit the race condition

kind: Service
apiVersion: v1
metadata:
  name: udp-service
spec:
  selector:
    type: server
  ports:
    - protocol: UDP
      port: 4000
---
apiVersion: v1
kind: Pod
metadata:
  name: udp-client
spec:
  containers:
  - name: udp-client-broken
    image: "alpine"
    # this client will hit the race condition
    # ssh to the host and run this from host namespace:
    #   - grep -w 4000 /proc/net/nf_conntrack
    command: ["/bin/sh", "-c", "while true; do echo client-1 $(date +%s); sleep 1; done | nc -v -u -p 2000 udp-service 4000"]
  - name: udp-client-should-work
    image: "alpine"
    # the initial sleep here gives the server time to get into the endpoint list / iptables
    command: ["/bin/sh", "-c", "echo waiting; sleep 60; echo starting; while sleep 1; do echo client-2 $(date +%s); done | nc -v -u -p 3000 udp-service 4000"]
---
apiVersion: v1
kind: Pod
metadata:
  name: udp-service
  labels:
    type: server
spec:
  containers:
  - name: udp-server
    image: "ruby"
    # using ruby to listen to multiple senders
    command: ["/bin/sh", "-c", "ruby -r socket -e 's = UDPSocket.new; s.bind(\"0.0.0.0\",4000); loop { m, c = s.recvfrom(4096); puts m; $stdout.flush; }'"]

Conntrack shows the following:

ipv4     2 udp      17 29 src=192.168.44.221 dst=10.100.0.161 sport=2000 dport=4000 [UNREPLIED] src=10.100.0.161 dst=1
0.0.2.15 sport=4000 dport=2000 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2

ipv4     2 udp      17 29 src=192.168.44.221 dst=10.100.0.161 sport=3000 dport=4000 [UNREPLIED] src=192.168.44.222 dst
=192.168.44.221 sport=4000 dport=3000 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2

Other info:

NAME          READY     STATUS    RESTARTS   AGE       IP               NODE            LABELS
udp-client    3/3       Running   0          7m        192.168.44.221   172.18.18.103   <none>
udp-service   1/1       Running   0          7m        192.168.44.222   172.18.18.103   type=server

NAME          CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
udp-service   10.100.0.161   <none>        4000/UDP   7m

NAME          ENDPOINTS             AGE
udp-service   192.168.44.222:4000   7m

core@k8s-node-02 ~ $ ip route | grep default
default via 10.0.2.2 dev eth0  proto dhcp  src 10.0.2.15  metric 1024

core@k8s-node-02 ~ $ ip a s eth0 | grep -w inet
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic eth0

fasaxc commented 7 years ago

@chino Agree, it looks like we're not playing nice with kube-proxy's rules there but I think k8s now has a work-around for the blackhole part of this:

See:

Can you check whether the conntrack entries get cleaned up when you add an endpoint to the service?

chino commented 7 years ago

My example manifest above has endpoints that appear later but it still wasn't working.

The calico tutorials use v1.7.0

I believe the changes should be in v1.7.1+

I tried with v1.7.4 but still ran into the issue

Which lead me to find that the conntrack tool that kube-proxy calls isn't available in coreos:

Aug 24 23:51:20 k8s-node-02 kube-proxy[1452]: E0824 23:51:20.362870    1452 conntrack.go:42] conntrack returned error: error looking for path of conntrack: exec: "conntrack": executable file not found in $PATH

fasaxc commented 7 years ago

I split the vagrant part of this out into https://github.com/projectcalico/calico/issues/1058

chino commented 7 years ago

I can confirm the issues is fixed in k8s v1.7.4 when endpoints are added. I added notes explaining how to fix up the vagrant tutorials to #1058.

It appears the conversation of sending REJECTs via the FORWARD table is happening over at: https://github.com/kubernetes/kubernetes/issues/48719

carsonoid commented 7 years ago

I see two options here:

Put the cali-FORWARD after the KUBE-SERVICES rule in the filter:FORWARD table.

Excerpt from the table as it is now:

# iptables -n -t filter --line-numbers -L FORWARD
Chain FORWARD (policy ACCEPT)
num  target     prot opt source               destination         
1    cali-FORWARD  all  --  0.0.0.0/0            0.0.0.0/0            /* cali:wUHhoiAYhphO9Mso */
2    KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0           
...

If the KUBE-SERVICES rule comes first then the standard reject rules for services without endpoints will get hit.

The other option is to not explicitly ACCEPT in the cali-forward table. So that the rest of the forward rules get processed like normal. Ex: don't put in rule #4 below.

 iptables -n -t filter --line-numbers -L cali-FORWARD 
Chain cali-FORWARD (1 references)
num  target     prot opt source               destination         
1    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            /* cali:jxvuJjmmRV135nVu */ mark match 0x1000000/0x1000000 ctstate UNTRACKED
2    cali-from-wl-dispatch  all  --  0.0.0.0/0            0.0.0.0/0            /* cali:nu_3aWP3DUkeeFF6 */
3    cali-to-wl-dispatch  all  --  0.0.0.0/0            0.0.0.0/0            /* cali:DjrV_uMYqr-g4joA */
4    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            /* cali:Hl34eZwIcbzmic3y */
...

And then the packet will drop back to the FORWARD table and eventually get rejected in KUBE-SERVICES if there are no endpoints

I've tested both options by disabling felix and removing/updating the rules manually. Both resulted in the proper rejection when services had no endpoints.

fasaxc commented 7 years ago

@carsonoid I believe we have a config option to change the ACCEPT to a RETURN, see IptablesAllowAction (FELIX_IPTABLESALLOWACTION) in https://docs.projectcalico.org/v2.5/reference/felix/configuration

caseydavenport commented 6 years ago

xref k8s PR: https://github.com/kubernetes/kubernetes/issues/60124#issuecomment-369322795

I suggested to use RETURN instead of ACCEPT in that case as well. Relies on kube-proxy accepting the traffic when there are endpoints, which it should do if --cluster-cidr is set.

chino commented 6 years ago

Hm, I guess this still hasn't moved forward?

tuminoid commented 6 years ago

Using Calico 3.0.6 with k8s 1.10.4, I've tried setting FELIX_IPTABLESFILTERALLOWACTION to RETURN, FELIX_IPTABLESMANGLEALLOWACTION to RETURN and FELIX_DEFAULTENDPOINTTOHOSTACTION to RETURN, and it does not make any difference. Services with no endpoints keep hanging instead of refusing.

CALICO_IPV4POOL_CIDR matches with kube-proxy's --cluster-cidr as well.

Reading comments in #1058, and considering I'm using official kube container images (that contain nothing but single go binary), could it be that kube-proxy does not have access to conntrack?

fasaxc commented 6 years ago

@tuminoid Yes, that's possible. I'm also not sure that kube-proxy ever got a fix for this.

tuminoid commented 6 years ago

Quick update. I rebuilt our k8s containers with k8s 1.11.2 and added conntrack for kube-proxy, no change in behavior. Connections still hang.

tuminoid commented 5 years ago

k8s has merged kubernetes/kubernetes#72534 in kube-proxy for upcoming 1.14 release, which should fix this issue.

hakman commented 4 years ago

We started adding e2e tests for network plugins for kubernetes/kops and noticed that this is still an issue. Any plans for a fix? https://testgrid.k8s.io/sig-cluster-lifecycle-kops#kops-aws-cni-calico

caseydavenport commented 4 years ago

@hakman thanks for poking this one. There hasn't been much activity in this area. I think it makes sense to fix, but I don't currently have a timeline / priority for it. Anyone interested in proposing and submitting a patch? Would love to review it :heart:

caseydavenport commented 4 years ago

xposting here since I believe this is the same issue: https://github.com/projectcalico/calico/issues/3548

caseydavenport commented 1 year ago

For posterity: https://github.com/projectcalico/felix/pull/2424

projectcalico / calico

Connections to Kubernetes Services with no endpoints hang #1055

Steps to Reproduce (for bugs)