Kubelet can't reach kube-apiserver via service IP

taha-adel commented 1 month ago

Cluster Description

I've a three nodes Kubernetes cluster deployed via Kubespray utility. All control plane components deployed in the node network via hostNetwork: true parameter. Here are some information about the cluster networking

Nodes CIDR: 10.0.48.0/21
Service CIDR: 10.233.0.0/18
Pod CIDR: 10.233.64.0/18
Calico backend: VXLAN

Issue Description

When I run any pod, Calico assign IP address to the pod and Kubelet tries to reach kube-apiserver via its service IP but it times out keeping the pod in ContainerCreating state. I used tcpdump to check the packets between Kubelet and kube-apiserver and I found that all packets are SNATed to the service IP as shown below

12:52:27.781606 IP 10.233.0.1.45208 > 10.0.55.68.6443: Flags [S], seq 3350995251, win 65495, options [mss 65495,sackOK,TS val 1474248146 ecr 0,nop,wscale 7], length 0
E..<..@.?...

Expected Behavioro

Destination IP is DNATed to the IP of the service endpoint and the source IP is kept as the node IP.

Current Behavior

Destination IP is DNATed to the IP of the service endpoint and the source IP is SNATed to the Service IP

Possible Solution

Steps to Reproduce (for bugs)

Get into one of the master nodes
Initiate telnet request to 10.233.0.1 443

Context

Your Environment

Calico version: v3.22.2
Orchestrator version (e.g. kubernetes, mesos, rkt): v1.23.6
Operating System and version: Ubuntu 18.04.6 LTS
Link to your project (optional): N/A

caseydavenport commented 1 month ago

Calico version: v3.22.2

This is a very old version of Calico that is no longerin support. I recommend upgrading to a modern version.

caseydavenport commented 1 month ago

12:52:27.781606 IP 10.233.0.1.45208 > 10.0.55.68.6443: Flags [S], seq 3350995251, win 65495, options [mss 65495,sackOK,TS val 1474248146 ecr 0,nop,wscale 7], length 0

Could you provide a bit more of the tcpdump output as well as some more information about what each IP address belongs to? e.g., what is 10.0.55.68?

taha-adel commented 1 month ago

Thank you @caseydavenport for your reply.

Let me further explain this. We have here three master nodes running kube-apiserver with the following IPs:

NAME    STATUS   ROLES                  AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
node1   Ready    control-plane,master   2y67d   v1.23.6   10.0.55.68    <none>        Ubuntu 18.04.6 LTS   4.15.0-213-generic   containerd://1.6.4
node2   Ready    control-plane,master   2y67d   v1.23.6   10.0.55.69    <none>        Ubuntu 18.04.6 LTS   4.15.0-213-generic   containerd://1.6.4
node3   Ready    control-plane,master   2y67d   v1.23.6   10.0.55.70    <none>        Ubuntu 18.04.6 LTS   4.15.0-213-generic   containerd://1.6.4

and here are all information related to the kube-apiserver service:

NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.233.0.1   <none>        443/TCP   2y67d

NAME         ENDPOINTS                                         AGE
kubernetes   10.0.55.68:6443,10.0.55.69:6443,10.0.55.70:6443   2y67d

The issue only occurs when I try to reach the kube-apiserver via its service IP from one of the nodes.

When I try to initiate a telnet 10.233.0.1 443 from node1, I get success response every three tries (the first two tries will timeout and the third will get connected). The succeeded request is what's routed to node1.

Actually, I don't know what's the more info that I can get from tcpdump. The only meaningful information I can get is src IP/Port and dst IP/Port. Anyway, here are more packet captures:

06:16:50.972450 IP 10.233.0.1.35700 > 10.0.55.69.6443: Flags [S], seq 4034289792, win 65495, 
options [mss 65495,sackOK,TS val 1049327718 ecr 0,nop,wscale 7], length 0
E..<..@.?.9.
...
.7E.t.+.v`.........L].........
>.xf........
06:16:51.432410 IP 10.233.0.1.49798 > 10.0.55.68.6443: Flags [S], seq 1006463059, win 65495, 
options [mss 65495,sackOK,TS val 1536912646 ecr 0,nop,wscale 7], length 0
E..<..@.?...
...
.7D...+;.hS.........2.........
[.m.........
06:16:51.976770 IP 10.233.0.1.35700 > 10.0.55.69.6443: Flags [S], seq 4034289792, win 65495, 
options [mss 65495,sackOK,TS val 1049328722 ecr 0,nop,wscale 7], length 0
E..<..@.?.9.
...
.7E.t.+.v`.........L].........
>.|R........
06:16:53.311681 IP 10.0.55.68.27432 > 10.0.55.69.6443: Flags [P.], seq 218:264, ack 37214, wi
n 1513, options [nop,nop,TS val 2479777417 ecr 3120908656], length 46
E..bR.@.?.f}
.7D
.7Ek(.+.>.?;V.............
..f...Ip....).............r..p|..F...L.......i,..Y....
06:16:53.311965 IP 10.0.55.69.6443 > 10.0.55.68.27432: Flags [.], ack 264, win 501, options [nop,nop,TS val 3120912778 ecr 2479777417], length 0
E..4..@.@...
.7E
.7D.+k(;V...>.m.....k.....
..Y...f.
06:16:53.314658 IP 10.0.55.69.6443 > 10.0.55.68.27432: Flags [P.], seq 37214:37310, ack 264, win 501, options [nop,nop,TS val 3120912781 ecr 2479777417], length 96
E.....@.@..W
.7E
.7D.+k(;V...>.m...........
..Y...f.....[.....!...A...y.J..........R.H.....o.........B............I{.....2./..q.;M...u...F..L......3
06:16:53.314680 IP 10.0.55.69.6443 > 10.0.55.68.27432: Flags [P.], seq 37310:37674, ack 264, win 501, options [nop,nop,TS val 3120912781 ecr 2479777417], length 364
E.....@.@..J
.7E
.7D.+k(;V...>.m...........
..Y...f.....g.....!....`^....-@....Ja}<.+...[...@......7[..>..}.B1..    D.u.6...&PK...{....}.].q.....?2".......tg.1..[....5...!?.(..*L...L..N..0Z.PM.=....@..%.n..Y..F..`O.",.).O(...).l.~...h.Ho%...&.+M.o.Zp?..S.........H..z.....L..]...v....a..j.......~.j.....$.

caseydavenport commented 1 month ago

I get success response every three tries (the first two tries will timeout and the third will get connected).

It's feels related that there are three hosts running the apiserver, and you get a response every three tries. Sounds like round-robin load balancing and only the requests forwarded to the local API server are succeeding.

Are you by chance using the kube-proxy in IPVS mode?

To figure out where the NAT might be occurring, I'd look at the output of iptables-save -c on node1 while sending traffic and check for incrementing counters on rules with SNAT or MASQUERADE actions. Note that if you're using IPVS kube-proxy, this will be less effective since the load balancing will be performed in IPVS instead.

taha-adel commented 1 month ago

@caseydavenport Yes I'm using Kube-proxy in IPVS mode.

$ ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.233.0.1:443 rr
  -> 10.0.55.68:6443              Masq    1      1          0         
  -> 10.0.55.69:6443              Masq    1      6          0         
  -> 10.0.55.70:6443              Masq    1      1          0

caseydavenport commented 1 month ago

I'm afraid I'm not much of an expert on the IPVS kube-proxy, but it looks like the NAT you're seeing is happening in IPVS based on the Masq denotation in the output you provided. Perhaps there's an IPVS proxy configuration option to turn off that masquerade?

That said, I am not sure if the MASQ is necessarily a problem - could just be a cross-node connectivity issue (e.g., security group configuration) blocking one node from talking to the others.

Generally I advise against using IPVS unless it's absolutely necessary - it doesn't have a lot of support upstream in k8s these days.

taha-adel commented 1 month ago

Issue resolved after restarting the nodes.

projectcalico / calico