kubeovn / kube-ovn

A Bridge between SDN and Cloud Native (Project under CNCF)
https://kubeovn.github.io/docs/stable/en/
Apache License 2.0
1.94k stars 442 forks source link

[BUG] Hairpin traffic is not working for VMs and `externalTrafficPolicy: Local` #4457

Closed kvaps closed 1 month ago

kvaps commented 1 month ago

Kube-OVN Version

v1.12.19

Kubernetes Version

v1.30.0

Operation-system/Kernel Version

OS-IMAGE         KERNEL-VERSION
Talos (v1.7.1)   6.6.29-talos  

Description

Hi, I use kube-ovn with cilium with kube-proxy replacement in chaining mode.

I have three nodes: srv1, srv2, srv3

I have two nginx-ingress-controller pods running on srv1 and srv2

I have LoadBalancer service for nginx-ingress-controller with externalTrafficPolicy: Local. This service have two IPs:

NAME                       TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
nginx-ingress-controller   LoadBalancer   10.96.181.254   10.2.0.241    80:32668/TCP,443:31847/TCP   34d

I have VM running on srv3, and it can't access this service via 10.2.0.241:

ubuntu@testvm:~$ curl 10.2.0.241
curl: (7) Failed to connect to 10.2.0.241 port 80 after 0 ms: Connection refused

tcpdump shows that node respond that tcp port 80 unreachable:

12:59:26.628915 IP 10.244.1.149.41128 > 10.2.0.241.80: Flags [S], seq 1411240999, win 65280, options [mss 1360,sackOK,TS val 545481049 ecr 0,nop,wscale 7], length 0
12:59:26.629031 IP 10.2.0.241 > 10.244.1.149: ICMP 10.2.0.241 tcp port 80 unreachable, length 72

I can see that packets on veth interface related to the pod with virtual machine and ovn0 interface, but I don't see it on external interface of the node. srv2 node, which is actually serving this IP-address doesn't have these packets recieved as well.

If I curl from the host netwroking namespace of srv3, I see that everything is working fine, but packet routed through the external interface of the node:

/ # ip route get 10.2.0.241
10.2.0.241 dev bond0  src 10.2.0.213

Steps To Reproduce

  1. create three node cluster

  2. install kubeovn, kubevirt and metallb

  3. cordon node3

  4. install ingress-nginx-controller with --set controller.service.externalTrafficPolicy=Local

  5. uncordon node3

  6. cordon node1 and node2

  7. Create VM:

    apiVersion: kubevirt.io/v1
    kind: VirtualMachine
    metadata:
     name: testvm
    spec:
     running: true
     template:
       spec:
         domain:
           devices:
             disks:
             - disk:
                 bus: virtio
               name: containerdisk
             - disk:
                 bus: virtio
               name: cloudinitdisk
             interfaces:
             - name: default
               bridge: {}
           resources:
             requests:
               memory: 1024M
         terminationGracePeriodSeconds: 0
         accessCredentials:
         volumes:
         - containerDisk:
             image: ghcr.io/aenix-io/cozystack/ubuntu-container-disk:v1.30.1@sha256:81caf89efe252ae2ca1990d08a3a314552d70ff36bcd4022b173c7150fbec805
           name: containerdisk
         - cloudInitNoCloud:
             userData: |-
               #cloud-config
               password: ubuntu
               chpasswd: { expire: False }
           name: cloudinitdisk
         networks:
         - name: default
           pod: {}
  8. Login to this vm:

    virtctl console testvm

    login: ubuntu password: ubuntu

  9. curl nginx-ingress service via external IP

Current Behavior

ubuntu@testvm:~$ curl  10.2.0.241
curl: (7) Failed to connect to 10.2.0.241 port 80 after 0 ms: Connection refused

Expected Behavior

ubuntu@testvm:~$ curl  10.2.0.241
default backend - 404
kvaps commented 1 month ago

Not sure if it is somehow useful, but I just tried to do a trace:

ovn-trace --no-friendly-names --ovs 79ddba83-8e94-4b7e-ae2c-e15e88323208 'inport == "kubernetes-infra-md0-jvsx2-mf45n.tenant-root" && ip4.dst == 10.2.0.241'

Got this output:

# ip,reg14=0x49,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=0.0.0.0,nw_dst=10.2.0.241,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0,nw_frag=no

ingress(dp="79ddba83-8e94-4b7e-ae2c-e15e88323208", inport="kubernetes-infra-md0-jvsx2-mf45n.tenant-root")
---------------------------------------------------------------------------------------------------------
 0. ls_in_check_port_sec (northd.c:8849): 1, priority 50, uuid 91c56424
    cookie=0x91c56424, duration=9483.548s, table=8, n_packets=16415646, n_bytes=7207673331, idle_age=0, priority=50,metadata=0x3 actions=set_field:0/0x1000->reg10,resubmit(,73),move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111],resubmit(,9)
    cookie=0x91c56424, duration=9483.548s, table=8, n_packets=5270831, n_bytes=3628623976, idle_age=0, priority=50,metadata=0x2 actions=set_field:0/0x1000->reg10,resubmit(,73),move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111],resubmit(,9)
    reg0[15] = check_in_port_sec();
    next;
 5. ls_in_pre_lb (northd.c:6182): ip, priority 100, uuid ac5a5ecb
    cookie=0xac5a5ecb, duration=9483.515s, table=13, n_packets=0, n_bytes=0, idle_age=9483, priority=100,ipv6,metadata=0x3 actions=set_field:0x4000000000000000000000000/0x4000000000000000000000000->xxreg0,resubmit(,14)
    cookie=0xac5a5ecb, duration=9483.515s, table=13, n_packets=2692577, n_bytes=470913908, idle_age=0, priority=100,ip,metadata=0x3 actions=set_field:0x4000000000000000000000000/0x4000000000000000000000000->xxreg0,resubmit(,14)
    reg0[2] = 1;
    next;
 6. ls_in_pre_stateful (northd.c:6309): reg0[2] == 1, priority 110, uuid caba6a0d
    cookie=0xcaba6a0d, duration=9483.550s, table=14, n_packets=0, n_bytes=0, idle_age=9483, priority=110,ipv6,reg0=0x4/0x4,metadata=0x3 actions=ct(table=15,zone=NXM_NX_REG13[0..15],nat)
    cookie=0xcaba6a0d, duration=9483.550s, table=14, n_packets=0, n_bytes=0, idle_age=9483, priority=110,ip,reg0=0x4/0x4,metadata=0x2 actions=ct(table=15,zone=NXM_NX_REG13[0..15],nat)
    cookie=0xcaba6a0d, duration=9483.550s, table=14, n_packets=0, n_bytes=0, idle_age=9483, priority=110,ipv6,reg0=0x4/0x4,metadata=0x2 actions=ct(table=15,zone=NXM_NX_REG13[0..15],nat)
    cookie=0xcaba6a0d, duration=9483.550s, table=14, n_packets=2607440, n_bytes=460066484, idle_age=0, priority=110,ip,reg0=0x4/0x4,metadata=0x3 actions=ct(table=15,zone=NXM_NX_REG13[0..15],nat)
    ct_lb_mark;

ct_lb_mark /* default (use --ct to customize) */
------------------------------------------------
 7. ls_in_acl_hint (northd.c:6405): !ct.new && ct.est && !ct.rpl && ct_mark.blocked == 0, priority 4, uuid 82c8981d
    cookie=0x82c8981d, duration=9483.523s, table=15, n_packets=2047082, n_bytes=357083384, idle_age=0, priority=4,ct_state=-new+est-rpl+trk,ct_mark=0/0x1,metadata=0x3 actions=set_field:0x100000000000000000000000000/0x100000000000000000000000000->xxreg0,set_field:0x400000000000000000000000000/0x400000000000000000000000000->xxreg0,resubmit(,16)
    reg0[8] = 1;
    reg0[10] = 1;
    next;
14. ls_in_after_lb (northd.c:7863): ip4.dst != {10.244.0.0/16}, priority 60, uuid 5f938f2c
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=120, n_bytes=16720, idle_age=9179, priority=60,ip,metadata=0x3,nw_dst=0.0.0.0/0.16.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=1962405, n_bytes=407161675, idle_age=0, priority=60,ip,metadata=0x3,nw_dst=0.0.0.0/0.32.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=0, n_bytes=0, idle_age=9483, priority=60,ip,metadata=0x3,nw_dst=0.0.0.0/0.64.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=0, n_bytes=0, idle_age=9483, priority=60,ip,metadata=0x3,nw_dst=16.0.0.0/16.0.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=0, n_bytes=0, idle_age=9483, priority=60,ip,metadata=0x3,nw_dst=32.0.0.0/32.0.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=0, n_bytes=0, idle_age=9483, priority=60,ip,metadata=0x3,nw_dst=4.0.0.0/4.0.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=0, n_bytes=0, idle_age=9483, priority=60,ip,metadata=0x3,nw_dst=1.0.0.0/1.0.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=0, n_bytes=0, idle_age=9483, priority=60,ip,metadata=0x3,nw_dst=0.2.0.0/0.2.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=0, n_bytes=0, idle_age=9483, priority=60,ip,metadata=0x3,nw_dst=0.0.0.0/0.128.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=378, n_bytes=42225, idle_age=104, priority=60,ip,metadata=0x3,nw_dst=0.8.0.0/0.8.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=0, n_bytes=0, idle_age=9483, priority=60,ip,metadata=0x3,nw_dst=0.0.0.0/0.4.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=0, n_bytes=0, idle_age=9483, priority=60,ip,metadata=0x3,nw_dst=0.0.0.0/8.0.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=2994, n_bytes=273199, idle_age=7999, priority=60,ip,metadata=0x3,nw_dst=0.1.0.0/0.1.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=0, n_bytes=0, idle_age=9483, priority=60,ip,metadata=0x3,nw_dst=64.0.0.0/64.0.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=0, n_bytes=0, idle_age=9483, priority=60,ip,metadata=0x3,nw_dst=0.0.0.0/2.0.0.0 actions=resubmit(,23)
    cookie=0x5f938f2c, duration=9483.516s, table=22, n_packets=641544, n_bytes=52571872, idle_age=0, priority=60,ip,metadata=0x3,nw_dst=128.0.0.0/1 actions=resubmit(,23)
    next;
15. ls_in_pre_hairpin (northd.c:8008): ip && ct.trk, priority 100, uuid c90f7c67
    cookie=0xc90f7c67, duration=9483.523s, table=23, n_packets=2692578, n_bytes=470913115, idle_age=0, priority=100,ct_state=+trk,ip,metadata=0x3 actions=set_field:0/0x80->reg10,resubmit(,68),move:NXM_NX_REG10[7]->NXM_NX_XXREG0[102],set_field:0/0x80->reg10,resubmit(,69),move:NXM_NX_REG10[7]->NXM_NX_XXREG0[108],resubmit(,24)
    cookie=0xc90f7c67, duration=9483.523s, table=23, n_packets=0, n_bytes=0, idle_age=9483, priority=100,ct_state=+trk,ipv6,metadata=0x3 actions=set_field:0/0x80->reg10,resubmit(,68),move:NXM_NX_REG10[7]->NXM_NX_XXREG0[102],set_field:0/0x80->reg10,resubmit(,69),move:NXM_NX_REG10[7]->NXM_NX_XXREG0[108],resubmit(,24)
    reg0[6] = chk_lb_hairpin();
    reg0[12] = chk_lb_hairpin_reply();
    next;
26. ls_in_l2_lkup (northd.c:8782): 1, priority 0, uuid 110156bf
    cookie=0x110156bf, duration=9483.551s, table=34, n_packets=38, n_bytes=20012, idle_age=8583, priority=0,metadata=0x3 actions=set_field:0->reg15,resubmit(,71),resubmit(,35)
    cookie=0x110156bf, duration=9483.551s, table=34, n_packets=0, n_bytes=0, idle_age=9483, priority=0,metadata=0x2 actions=set_field:0->reg15,resubmit(,71),resubmit(,35)
    outport = get_fdb(eth.dst);
    next;
27. ls_in_l2_unknown (northd.c:8794): outport == "none", priority 50, uuid d1868fe7
    cookie=0xd1868fe7, duration=9483.553s, table=35, n_packets=38, n_bytes=20012, idle_age=8583, priority=50,reg15=0,metadata=0x3 actions=drop
    cookie=0xd1868fe7, duration=9483.553s, table=35, n_packets=0, n_bytes=0, idle_age=9483, priority=50,reg15=0,metadata=0x2 actions=drop
    drop;
kvaps commented 1 month ago

In ovn-sbctl dump-flows, I see my cluster-ip but don't see external-ip, maybe this is an issue?

root@srv2:/kube-ovn# ovn-sbctl dump-flows | grep 10.2.0.241
root@srv2:/kube-ovn# ovn-sbctl dump-flows | grep 10.96.181.254
  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[2] == 1 && ip4.dst == 10.96.181.254 && tcp.dst == 443), action=(reg1 = 10.96.181.254; reg2[0..15] = 443; ct_lb_mark;)
  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[2] == 1 && ip4.dst == 10.96.181.254 && tcp.dst == 80), action=(reg1 = 10.96.181.254; reg2[0..15] = 80; ct_lb_mark;)
  table=12(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.96.181.254 && tcp.dst == 443), action=(reg0[1] = 0; ct_lb_mark(backends=10.244.0.23:443,10.244.0.2:443);)
  table=12(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.96.181.254 && tcp.dst == 80), action=(reg0[1] = 0; ct_lb_mark(backends=10.244.0.23:80,10.244.0.2:80);)
kvaps commented 1 month ago

@zhangzujian could you please point me into right place in code where these rules for cluster-ip get created? I can investigate my time for adding external-ip support there as well.

zhangzujian commented 1 month ago

Kube-OVN adds an iptables rule for such traffic. But with cilium chaining, it seems the rule does not work.

kvaps commented 1 month ago

Kube-OVN adds an iptables rule for such traffic

I guess packets do not reach host networking namespace due to fact I can see them only in ovn0 interface and iptables rules added in host networking namespace are not working either.

Wouldn't it be better to add flow for external-ip, the same way as for cluster-ip? Afaik Cilium implements the same logic for external-ips

zhangzujian commented 1 month ago

I cannot reproduce the problem in my KIND cluster with kube-ovn v1.12.22 (or v1.13.0), cilium v1.16.1 and metallb v0.14.8.

zhangzujian commented 1 month ago

Could you please capture traffic on srv3 ovn0 and bond0 to see where the ICMP port unreachable packet comes from?

kvaps commented 1 month ago

Did you tested VM or pod? It is working fine for me for ordinary pods, but not for the VMs

zhangzujian commented 1 month ago

Did you tested VM or pod? It is working fine for me for ordinary pods, but not for the VMs

Both Pod and VM work fine.

kvaps commented 1 month ago

Hey. I just comapred the configuration with provided in Makefile and configuration that we use. And I found that this option makes effect: https://github.com/kubeovn/kube-ovn/blob/ee6e590d0b34c7c8ba6eeb6cd3416d6f98fbb8f4/Makefile#L832

When enabled externalTafficPolicy: Local is working fine

Sorry for taking your time.

BTW, my feature forceDeviceDetection https://github.com/cilium/cilium/pull/32730 has been succefully merged with cilium, so now it is possible to use --set devices=ovn0 --set forceDeviceDetection=true to automatically add ovn0 device and still have automatic device detection enabled