externalTrafficPolicy=Local not working as expected with ebpf dataplane and kube-vip as LB

jaxklag commented 5 days ago

Hello,

I'm facing a strange problem on a fresh install of calico with ebpf dataplane.

Maybe i've missed something in the configuration of ebpf dataplane ...

I give you all the info i have and i'm available to give you more trace if needed.

Thansks for your help !

Expected Behavior

~~with "externalTrafficPolicy: Cluster" set on service, source IP must be SNAT with ip of k8s node.~~

with "externalTrafficPolicy: Local" set on service, source IP must be preserved and most of all, service must be reachable from outside the cluster.

Current Behavior

~~When "externalTrafficPolicy: Cluster" set on service, source IP is preserved and not replaced with those of k8s node. Customer can reach external IP of the service from outside the k8s cluster.~~

When "externalTrafficPolicy: Local" set on service, source IP. Customer can't reach external IP of the service. The customer network traffc ends on k8s node which have the external IP.

Steps to Reproduce (for bugs)

Create a basic nginx deployment with a replica=2 (the replica count has no incidence on the problem. For exemple with replica=1 problem happens

$ kubectl -n jax get deployments.apps nginx-jax -oyaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx-jax
  name: nginx-jax
  namespace: jax
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nginx-jax
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nginx-jax
    spec:
      containers:
      - image: nginx:latest
        imagePullPolicy: Always
        name: nginx-jax
        ports:
        - containerPort: 80
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

Create a k8s LoadBalancer service with externalTrafficPolicy set to Local

$ kubectl -n jax get svc nginx-jax -oyaml
apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx-jax
    implementation: kube-vip
  name: nginx-jax
  namespace: jax
spec:
  allocateLoadBalancerNodePorts: false
  clusterIP: 10.233.43.116
  clusterIPs:
  - 10.233.43.116
  externalTrafficPolicy: Local
  healthCheckNodePort: 30726
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerIP: 10.128.41.230
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx-jax
  sessionAffinity: None
  type: LoadBalancer

Curl on external service IP with Local mode. Curl done from outside the k8s cluster. client IP = 192.168.221.229 Curl timeout and traffic is stopped on worker node with external P configured on it

$ curl http://10.128.41.230/
curl: (7) Failed to connect to 10.128.41.230 port 80: Connexion refused

Go on k8s node named "k8sw1" with external IP configured on int ens192 (10.128.41.230)

$ ip a s dev ens192
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:b5:4a:a0 brd ff:ff:ff:ff:ff:ff
    altname enp11s0
    inet 10.128.41.211/26 brd 10.128.41.255 scope global noprefixroute ens192   <=== MAIN IP
       valid_lft forever preferred_lft forever
    inet 10.128.41.234/32 scope global ens192
       valid_lft forever preferred_lft forever
    inet 10.128.41.232/32 scope global ens192
       valid_lft forever preferred_lft forever
    inet 10.128.41.231/32 scope global ens192
       valid_lft forever preferred_lft forever
    inet 10.128.41.233/32 scope global ens192
       valid_lft forever preferred_lft forever
    inet 10.128.41.230/32 scope global ens192  <===== LOADBALANCER SERVICE IP
       valid_lft forever preferred_lft forever
    inet6 fe80::3b6:5392:c340:9c45/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

We don't see dnat done in any tcpdump output done on all worker nodes !!

$ tcpdump -nni any host 192.168.221.229 and port 80
tcpdump: data link type LINUX_SLL2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes

15:31:10.220141 ens192 In  IP 192.168.221.229.47456 > 10.128.41.230.80: Flags [S], seq 541190242, win 29200, options [mss 1460,sackOK,TS val 3539896342 ecr 0,nop,wscale 7], length 0
^C
1 packet captured
2 packets received by filter
0 packets dropped by kernel

NAT rule from calico-node on "k8sw1" node. Pod IP are correct into nat rules. Don't know if the rule has the expected output.

$ kubectl get pods -owide -n kube-system | grep calico-node | grep k8sw1
calico-node-c29m8                           1/1     Running   0              89m   10.128.41.211   tstt9-d1-co-k8sw1   <none>           <none>

$ kubectl exec -n kube-system calico-node-c29m8 -- calico-node -bpf nat dump |grep -b3 "10.128.41.230 port 80"
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), flexvol-driver (init)
4541:10.128.41.230 port 80 proto 6 id 36 count 2 local 0 flags external-local
4614-    36:0     10.233.123.248:80
4639-    36:1     10.233.84.230:80

$ kubectl -n jax get pods -owide
NAME                         READY   STATUS    RESTARTS      AGE   IP               NODE                NOMINATED NODE   READINESS GATES
nginx-jax-7fbcfcdf8f-79fvt   1/1     Running   2 (87m ago)   29h   10.233.84.230    tstt9-d1-co-k8sw3   <none>           <none>
nginx-jax-7fbcfcdf8f-8h6fj   1/1     Running   0             1h    10.233.123.248   tstt9-d1-co-k8sw2   <none>           <none>

I've got here a first problem. When externalTrafficPolicy is set to Local, it's impossible to reach the service via its external IP.

If i just replace externalTrafficPolicy from "Local" to "Cluster", service is reachable. See debug below.

Update the k8s LoadBalancer service with externalTrafficPolicy set to Cluster

$ kubectl -n jax get svc nginx-jax -oyaml
apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx-jax
    implementation: kube-vip
  name: nginx-jax
  namespace: jax
spec:
  allocateLoadBalancerNodePorts: false
  clusterIP: 10.233.43.116
  clusterIPs:
  - 10.233.43.116
  externalTrafficPolicy: Cluster
  healthCheckNodePort: 30726
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerIP: 10.128.41.230
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx-jax
  sessionAffinity: None
  type: LoadBalancer

Curl on external service IP with Cluster mode. Curl done from outside the k8s cluster. client IP = 192.168.221.229

Got answer from curl.

$ curl http://10.128.41.230/
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...
</body>
</html>

Go on k8s node named "k8sw1" with external IP configured on int ens192 (10.128.41.230)

We observed http traffic.

We do not see response packet because of DSR

$ tcpdump -nni any host 192.168.221.229 and port 80
tcpdump: data link type LINUX_SLL2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes

15:53:53.390402 ens192 In  IP 192.168.221.229.57802 > 10.128.41.230.80: Flags [S], seq 2997794786, win 29200, options [mss 1460,sackOK,TS val 3541259513 ecr 0,nop,wscale 7], length 0
15:53:53.408214 ens192 In  IP 192.168.221.229.57802 > 10.128.41.230.80: Flags [.], ack 220191716, win 229, options [nop,nop,TS val 3541259531 ecr 2755838583], length 0
15:53:53.408239 ens192 In  IP 192.168.221.229.57802 > 10.128.41.230.80: Flags [P.], seq 0:77, ack 1, win 229, options [nop,nop,TS val 3541259531 ecr 2755838583], length 77: HTTP: GET / HTTP/1.1
15:53:53.426190 ens192 In  IP 192.168.221.229.57802 > 10.128.41.230.80: Flags [.], ack 239, win 237, options [nop,nop,TS val 3541259549 ecr 2755838601], length 0
15:53:53.426213 ens192 In  IP 192.168.221.229.57802 > 10.128.41.230.80: Flags [.], ack 854, win 247, options [nop,nop,TS val 3541259549 ecr 2755838602], length 0
15:53:53.426235 ens192 In  IP 192.168.221.229.57802 > 10.128.41.230.80: Flags [F.], seq 77, ack 854, win 247, options [nop,nop,TS val 3541259549 ecr 2755838602], length 0
15:53:53.444202 ens192 In  IP 192.168.221.229.57802 > 10.128.41.230.80: Flags [.], ack 855, win 247, options [nop,nop,TS val 3541259567 ecr 2755838619], length 0
^C
7 packets captured
8 packets received by filter
0 packets dropped by kernel

I follow the HTTP traffic on k8s node named "k8sw3" with one of the nginx pod We observe the response is sent directly thanks to DSR.

The problem here is that the source IP is not replaced with the IP of k8sw1 node despite the Cluster mode of the service. Normal ?

$ tcpdump -nni any host 192.168.221.229 and port 80
tcpdump: data link type LINUX_SLL2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes

15:53:53.390496 cali41c5c8c6b53 Out IP 192.168.221.229.57802 > 10.233.84.230.80: Flags [S], seq 2997794786, win 29200, options [mss 1460,sackOK,TS val 3541259513 ecr 0,nop,wscale 7], length 0
15:53:53.390524 cali41c5c8c6b53 In  IP 10.233.84.230.80 > 192.168.221.229.57802: Flags [S.], seq 220191715, ack 2997794787, win 31856, options [mss 1460,sackOK,TS val 2755838583 ecr 3541259513,nop,wscale 7], length 0
15:53:53.390539 ens192 Out IP 10.128.41.230.80 > 192.168.221.229.57802: Flags [S.], seq 220191715, ack 2997794787, win 31856, options [mss 1460,sackOK,TS val 2755838583 ecr 3541259513,nop,wscale 7], length 0
15:53:53.408311 cali41c5c8c6b53 Out IP 192.168.221.229.57802 > 10.233.84.230.80: Flags [.], ack 1, win 229, options [nop,nop,TS val 3541259531 ecr 2755838583], length 0
15:53:53.408315 cali41c5c8c6b53 Out IP 192.168.221.229.57802 > 10.233.84.230.80: Flags [P.], seq 1:78, ack 1, win 229, options [nop,nop,TS val 3541259531 ecr 2755838583], length 77: HTTP: GET / HTTP/1.1
15:53:53.408361 cali41c5c8c6b53 In  IP 10.233.84.230.80 > 192.168.221.229.57802: Flags [.], ack 78, win 249, options [nop,nop,TS val 2755838601 ecr 3541259531], length 0
15:53:53.408371 ens192 Out IP 10.128.41.230.80 > 192.168.221.229.57802: Flags [.], ack 78, win 249, options [nop,nop,TS val 2755838601 ecr 3541259531], length 0
15:53:53.408606 cali41c5c8c6b53 In  IP 10.233.84.230.80 > 192.168.221.229.57802: Flags [P.], seq 1:239, ack 78, win 249, options [nop,nop,TS val 2755838601 ecr 3541259531], length 238: HTTP: HTTP/1.1 200 OK
15:53:53.408630 ens192 Out IP 10.128.41.230.80 > 192.168.221.229.57802: Flags [P.], seq 1:239, ack 78, win 249, options [nop,nop,TS val 2755838601 ecr 3541259531], length 238: HTTP: HTTP/1.1 200 OK
15:53:53.408704 cali41c5c8c6b53 In  IP 10.233.84.230.80 > 192.168.221.229.57802: Flags [P.], seq 239:854, ack 78, win 249, options [nop,nop,TS val 2755838602 ecr 3541259531], length 615: HTTP
15:53:53.408708 ens192 Out IP 10.128.41.230.80 > 192.168.221.229.57802: Flags [P.], seq 239:854, ack 78, win 249, options [nop,nop,TS val 2755838602 ecr 3541259531], length 615: HTTP
15:53:53.426260 cali41c5c8c6b53 Out IP 192.168.221.229.57802 > 10.233.84.230.80: Flags [.], ack 239, win 237, options [nop,nop,TS val 3541259549 ecr 2755838601], length 0
15:53:53.426265 cali41c5c8c6b53 Out IP 192.168.221.229.57802 > 10.233.84.230.80: Flags [.], ack 854, win 247, options [nop,nop,TS val 3541259549 ecr 2755838602], length 0
15:53:53.426290 cali41c5c8c6b53 Out IP 192.168.221.229.57802 > 10.233.84.230.80: Flags [F.], seq 78, ack 854, win 247, options [nop,nop,TS val 3541259549 ecr 2755838602], length 0
15:53:53.426338 cali41c5c8c6b53 In  IP 10.233.84.230.80 > 192.168.221.229.57802: Flags [F.], seq 854, ack 79, win 249, options [nop,nop,TS val 2755838619 ecr 3541259549], length 0
15:53:53.426346 ens192 Out IP 10.128.41.230.80 > 192.168.221.229.57802: Flags [F.], seq 854, ack 79, win 249, options [nop,nop,TS val 2755838619 ecr 3541259549], length 0
15:53:53.444275 cali41c5c8c6b53 Out IP 192.168.221.229.57802 > 10.233.84.230.80: Flags [.], ack 855, win 247, options [nop,nop,TS val 3541259567 ecr 2755838619], length 0
^C
17 packets captured
23 packets received by filter
0 packets dropped by kernel

NAT rule from calico-node on "k8sw1" & "k8sw3" nodes. Pod IP are correct into nat rules. Don't know if the rules have the expected output.

$ kubectl get pods -owide -n kube-system | grep calico-node | egrep "k8sw1|k8sw3"
calico-node-c29m8                           1/1     Running   0               119m   10.128.41.211   tstt9-d1-co-k8sw1   <none>           <none>
calico-node-zlvrk                           1/1     Running   0               109m   10.128.41.213   tstt9-d1-co-k8sw3   <none>           <none>

$ kubectl exec -n kube-system calico-node-c29m8 -- calico-node -bpf nat dump |grep -b3 "10.128.41.230 port 80"
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), flexvol-driver (init)

4624:10.128.41.230 port 80 proto 6 id 37 count 2 local 0
4676-    37:0     10.233.123.248:80
4701-    37:1     10.233.84.230:80

$ kubectl exec -n kube-system calico-node-zlvrk -- calico-node -bpf nat dump |grep -b3 "10.128.41.230 port 80"
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), flexvol-driver (init)

178:10.128.41.230 port 80 proto 6 id 37 count 2 local 1
230-    37:0     10.233.84.230:80
254-    37:1     10.233.123.248:80

$ kubectl -n jax get pods -owide
NAME                         READY   STATUS    RESTARTS      AGE   IP               NODE                NOMINATED NODE   READINESS GATES
nginx-jax-7fbcfcdf8f-79fvt   1/1     Running   2 (87m ago)   29h   10.233.84.230    tstt9-d1-co-k8sw3   <none>           <none>
nginx-jax-7fbcfcdf8f-8h6fj   1/1     Running   0             1h    10.233.123.248   tstt9-d1-co-k8sw2   <none>           <none>

It works but there is no SNAT done on client source IP despite the Cluster setting for externalTrafficPolicy on the service.

Context

Quite simple.

Just want to have the choice to preserve or not the client IP source with ebpf dataplace and DSR activated. Just honor the value of the externalTrafficPolicy configured into the service.

Your Environment

Setup of calico using Kubespray v2.26.0 with traditionnal Linux dataplane.

Following Calico docs, activation of ebpf dataplane manually without using kubespray.

k8s cluster is not production. OS : Rocky Linux 9 with supported kernel

$ uname -r
5.14.0-427.37.1.el9_4.x86_64

k8s version : 1.29.6

$ kubectl version
Client Version: v1.30.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.6

kube-proxy is disabled as suggested into calico documentation

calico version : 3.28.1 and then upgraded to 3.28.2. Same problem in both versions

$ kubectl -n calico-apiserver get deployments.apps calico-apiserver -oyaml | grep "image:"
        image: quay.io/calico/apiserver:v3.28.2

$ kubectl -n kube-system get deployments.apps calico-kube-controllers -oyaml | grep "image:"
        image: quay.io/calico/kube-controllers:v3.28.2

$ kubectl -n kube-system get ds calico-node -oyaml | grep "image:"
        image: quay.io/calico/node:v3.28.2
        image: quay.io/calico/cni:v3.28.2
        image: quay.io/calico/cni:v3.28.2
      - image: quay.io/calico/pod2daemon-flexvol:v3.28.2

Calico Bpf External Service Mode : "DSR" ==> Same problem with "Tunnel". Note that both mode are working as expected regarding the network path used for response packets

$ kubectl -n kube-system describe felixconfigurations.projectcalico.org default 
Name:         default
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  projectcalico.org/v3
Kind:         FelixConfiguration
Metadata:
  Creation Timestamp:  2024-09-17T07:19:17Z
  Resource Version:    15999845
  UID:                 85254b13-7f32-44ab-9600-49dcbbc82ab5
Spec:
  Bpf Connect Time Load Balancing:      TCP
  Bpf Disable Unprivileged:             true
  Bpf Enabled:                          true
  Bpf External Service Mode:            DSR
  Bpf Host Networked NAT Without CTLB:  Enabled
  Bpf Log Level:                        
  Floating I Ps:                        Disabled
  Ipip Enabled:                         false
  Log Severity Screen:                  Info
  Reporting Interval:                   0s
  Vxlan Enabled:                        false
  Wireguard Enabled:                    false
Events:                                 <none>

Kube-vip is used to allocate external IP for service type LoadBalancer and configure IP on worker node.

Note thant even without kube-vip the problem happens.

$ kubectl -n kube-system get pods kube-vip-tstt9-d1-co-k8sw1 -oyaml | grep "image:"
    image: ghcr.io/kube-vip/kube-vip:v0.8.0
    image: ghcr.io/kube-vip/kube-vip:v0.8.0

$ kubectl -n kube-system get pods kube-vip-cloud-provider-85fd9b9cf7-v4fbr -oyaml | grep "image:"
    image: ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.10
    image: ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.10

tomastigera commented 5 days ago

@jaxklag

When "externalTrafficPolicy: Cluster" set on service, source IP is preserved and not replaced with those of k8s node. Customer can reach external IP of the service from outside the k8s cluster.

That is not a bug, that is a feature! That make writing policies much simpler for you!

with "externalTrafficPolicy: Cluster" set on service, source IP must be SNAT with ip of k8s node.

Nope, there is no such requirement. It is just that kube-proxy in iptables mode does not have much options to do it better and thus does it this way. And thus they had to come up the the externalTrafficPolicy mode to fix this issue and it also only select local pod so it saves the extra hop. If yo set it in ebpf dp, we also save the extra hop.

jaxklag commented 5 days ago

Hello @tomastigera

Thansk for your answer. I agree with you.

What about the fact that setting "Local" for externalTrafficPolicy cause the service to be unreachable ? Is it epexcted behaviour ?

I understand that this config parameter is useless with calico + ebpf dataplane but what about customers which can still use the externalTrafficPolicy parameter and as a consequence cause the service to be unreachble ?

Couldn't be ignored by Calico when ebpf is activated ?

Regards,

tomastigera commented 5 days ago

With Local you may get service unreachable if your connections land on a node which does not have the backing pod. That would happen if your LB does not take this into account.

tomastigera commented 5 days ago

Seems like the kube-vip might the problem. I am not expert on that. What does it do exactly? All I know is that it may not play exacly well with ebpf dataplanes. https://github.com/kube-vip/kube-vip/issues/594

I'd turn off the DSR mode to start with not to introduce more variables into this.
It would be good to se a tcpdump of traffic within the host that has the pod.
In one of the tcpdump you provided it seems like you are making a connection from a pod. Note that externalTrafficPolicy does not apply to local traffic. There is internalTrafficPolicy option as well.
It would be good to see routing tables within that node
Eventually we may want to see debug logs from ebpf (BPFLogLevel=Debug)

tomastigera commented 5 days ago

@jaxklag this may also be relevant https://github.com/projectcalico/calico/issues/9141

jaxklag commented 5 days ago

@tomastigera Thanks for your replies. I will study that Monday and come back to you with some information.

Does sysctl parameters still apply/ have incidence on network traffic handle by ebpf ?

I remember that i must set some specific sysctl config with DSR on standard linux traffic without k8s.

Actually, i didn't configured any sysctl for DSR on ens192 interface.

Regards,

jaxklag commented 3 days ago

@tomastigera

With Local you may get service unreachable if your connections land on a node which does not have the backing pod. That would happen if your LB does not take this into account.

Scaling replica to have a pod on k8s node named "k8sw1" (which has the LoadBalancer IP), make it works with externalTrafficPolicy set to "Local"

You were right.

On my side there is a misunderstanding on the use case of the "Local" parameter rather than "Cluster" with ebpf dataplane.

With ebpfs, as you explained previously, "Cluster" does the job preserving IP source and load-balancaing accros all the pod replica even when there is no pod on the node with the LoadBalancer IP.

"Local" prevents load-balancing on all the replicas of the pod since only the replica present on the node carrying the IP of the LoadBlancer service will receive traffic. Is Calico-node responsible of that decision ? Why this choice with ebpf backend ?

Regards,

jaxklag commented 3 days ago

Seems like the kube-vip might the problem. I am not expert on that. What does it do exactly? All I know is that it may not play exacly well with ebpf dataplanes. kube-vip/kube-vip#594

I'd turn off the DSR mode to start with not to introduce more variables into this.

It would be good to se a tcpdump of traffic within the host that has the pod.

In one of the tcpdump you provided it seems like you are making a connection from a pod. Note that externalTrafficPolicy does not apply to local traffic. There is internalTrafficPolicy option as well.

It would be good to see routing tables within that node

Eventually we may want to see debug logs from ebpf (BPFLogLevel=Debug)

I don't think so because even without kube-vip i had the same results. Cf. my previous answer which spots that the problem was that there was no pod replica running on the worker node with the LoadBalancer IP.

Regards

tomastigera commented 1 day ago

"Local" prevents load-balancing on all the replicas of the pod since only the replica present on the node carrying the IP of the LoadBlancer service will receive traffic. Is Calico-node responsible of that decision ? Why this choice with ebpf backend ?

That is how it is specified by k8s and its kube-proxy. We just follow the same behaviour.

projectcalico / calico