ovn-org / ovn-kubernetes

A robust Kubernetes networking platform
https://ovn-kubernetes.io/
Apache License 2.0
825 stars 347 forks source link

Cannot access K8s service cluster IP from the K8s master #758

Open girishmg opened 5 years ago

girishmg commented 5 years ago

Note: See https://docs.google.com/document/d/1anbU730-qMsdaBCyZxMDMtQe2SWoGOY56cs2nhzDmJ8/edit?usp=sharing for more information on the issue

How to Reproduce:

On the K8s master host run

nc –zv 10.96.0.1 443

and the command will time out.

(Note: make sure you don’t have any iptable rules installed by kube-proxy. Run: iptables –F && iptables –t nat –F && iptables –t mangle –F && iptables –X)

Software Version: OVS 2.11.1 and OVN 2.11.1 and Linux Upstream Kernel 4.15 (I have tried Linux OVS Tree Kernel as well using dkms).

Observation:

We are seeing an issue on K8s master with a HostNetwork Pod trying to access the K8s API server using kubernetes cluster IP of 10.96.0.1. The HostNetwork pod tries to access the K8s API server using 10.96.0.1:443 and the request timeouts.

Because of the following route in the host IP stack

$ ip ro |grep 10.96.0.0/12
10.96.0.0/12 via 192.168.0.1 dev 192.168.0.2

the packet from the HostNetwork pod will enter the OVN logical network through the K8s management port. It then hits the load balancer rule and gets translated to 10.0.2.16:6443

$ # ovn-nbctl lb-list
UUID                                    LB                  PROTO      VIP                IPs
03b2cf30-9b12-4538-926b-01a0225a5290                        udp        10.96.0.10:53      192.168.0.3:53,192.168.1.3:53
a482de69-af51-4ae2-a60c-81b32120e1f9                        tcp        10.96.0.10:53      192.168.0.3:53,192.168.1.3:53
                                                            tcp        10.96.0.10:9153    192.168.0.3:9153,192.168.1.3:9153
                                                            tcp        10.96.0.1:443      10.0.2.16:6443

The packet then hits ovn_cluster_router and matches the following route

$ ovn-nbctl lr-route-list ovn_cluster_router
IPv4 Routes
                10.0.2.16               192.168.0.2 dst-ip

and the packet is sent back to the management port. The corresponding OpenFlow rule in the OVS fails to get added due to ‘Invalid argument’

2019-06-26T23:37:10.257Z|00653|dpif(handler50)|WARN|system@ovs-system: execute ct(commit,zone=1,label=0/0x1),3 failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=00:00:00:0a:d0:10,dl_dst=9e:a2:fc:79:97:f7,nw_src=192.168.0.2,nw
_dst=10.0.2.16,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=9998,tp_dst=6443,tcp_flags=syn tcp_csum:b856
with metadata skb_priority(0),skb_mark(0),ct_state(0x21),ct_zone(0x1),ct_tuple4(src=192.168.0.2,dst=10.0.2.16,proto=6,tp_src=9998,tp_dst=6443),in_port(3) mtu 0

The failure is in net/openswitch/conntrack.c`ovs_ct_execute() method.

girishmg commented 5 years ago

More information on this issue:

Issue 1: Hairpin traffic on the management port

Logical topology

                +--------------------+
       mport    | ovn_cluster_router |
          |     +--+-----------------+
          |        |
         ++--------+-+
         |  node1    |
         +-----------+

Physical topology

         Host
         +---------------------+
         |   (192.168.1.2)     |
         |      mport          |
         |        |            |
         |      +-+----------+ |
         |      |   br-int   | |
         |      +------------+ |
         | 10.0.2.16           |
         +---------------------+
  1. Build logical topology

(NOTE: Instead of 10.0.2.16 IP use your test machine IP)

ovn-nbctl ls-add node1 -- set logical_switch node1 \
    other-config:subnet=192.168.1.0/24
ovn-nbctl lsp-add node1 port1 -- lsp-set-addresses \
    port1 "0:1:2:3:4:2 192.168.1.2"
ovn-nbctl lr-add ovn_cluster_router
ovn-nbctl lr-route-add ovn_cluster_router 10.0.2.16 192.168.1.2
ovn-nbctl lrp-add ovn_cluster_router rtos-node1 00:00:00:CB:5A:76 192.168.1.1/24
ovn-nbctl lsp-add node1 stor-node1 -- set logical_switch_port stor-node1 \
    type=router options:router-port=rtos-node1 addresses=\"00:00:00:CB:5A:76\"
uuid=`ovn-nbctl create load_balancer protocol=tcp`
ovn-nbctl set load_balancer $uuid vips:\"10.96.0.1:3333\"=\"10.0.2.16:80\"
ovn-nbctl ls-lb-add node1 $uuid
ovn-nbctl acl-add node1 to-lport 1001 ip4.src==192.168.1.2 allow-related
  1. Physical binding

    ovs-vsctl add-port br-int k8s-mport -- set interface k8s-mport type=internal \
      external_ids:iface-id=port1
    ip li set k8s-mport address 0:1:2:3:4:2
    ip a add dev k8s-mport 192.168.1.2/24
    ip li set k8s-mport up
    ip ro add 10.96.0.0/12 via 192.168.1.2
  2. On the host run the HTTP server

    python -m SimpleHTTPServer 80 &
  3. On the host disable rp_filter and enable accept_local on mport

    sysctl -w net.ipv4.conf.k8s-mport.rp_filter=0
    sysctl -w net.ipv4.conf.k8s-mport.accept_local=1
  4. On one terminal, tail ovs-vswitchd.log

    term1$ tail -f /var/log/openvswitch/ovs-vswitchd.log
  5. On second terminal, run tcpdump

    term2$ tcpdump -enni mport tcp port 9991
  6. On third terminal, run the hairpin traffic

    term3$ nc -p 9991 -zv 10.96.0.1 3333

The packet in host (192.168.1.2:9991, 10.96.0.1:3333) enters the OVN pipeline and the LB DNATs the packet to 10.0.2.16:80. The packet hits the ovn_cluster_router and matches the static route and is forwarded to k8s-mport. At this point, the ACL on the k8s-mport (to-lport ip4.src) hits and results in an Invalid argument error in ovs-vswitchd.log.

---------8<--------------8<-------- 2019-09-25T06:07:22.959Z|00001|dpif(handler1)|WARN|system@ovs-system: execute ct(commit,zone=1,label=0/0x1),1 failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=00:00:00:cb:5a:76,dl_dst=00:01:02:03:04:02,nw_src=192.168.1.2,nw_dst=10.10.0.11,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=9991,tp_dst=80,tcp_flags=syn tcp_csum:6273 with metadata skb_priority(0),skb_mark(0),ct_state(0x21),ct_zone(0x1),ct_tuple4(src=192.168.1.2,dst=10.10.0.11,proto=6,tp_src=9991,tp_dst=80),in_port(1) mtu 0 ---------8<--------------8<--------

Now, if I remove the ACL (to-lport ip4.src==192.168.1.2), then the above warning goes away. So, clearly the CT zone ID for the logical_switch_port is used twice to insert state.

Say that we fix the above issue in OVN, we will end up with a different issue as explained below. (I just removed the ACL to check if things work fine, but it didn't, and I ended up in the issue below)

Issue 2: TCP client state machine sends Reset

|------------------+------------------+------------------+-----------------------------|
| Location         |        SRC:SPort |        DST:DPort | Notes                       |
|------------------+------------------+------------------+-----------------------------|
| Client (Host)    | 192.168.1.2:9991 |   10.96.0.1:3333 |                             |
| OVN pipeline     | 192.168.1.2:9991 |     10.0.2.16:80 | LB DNAT on `node1` logical  |
|                  |                  |                  | switch                      |
| ---------------   packet hairpins and comes back to the host ----------------------- |
|                  |                  |                  |                             |
| Client (Host)    | 192.168.1.2:9991 |     10.0.2.16:80 | The Web server receives the |
|                  |                  |                  | SYN packet                  |
|                  |                  |                  |                             |
| WebServer (Host) |     10.0.2.16:80 | 192.168.1.2:9991 | Server sends SYN+ACK to     |
|                  |                  |                  | 192.168.1.2:9991 instead of |
|                  |                  |                  | 10.96.0.1:3333              |
|                  |                  |                  |                             |
| Kernel (Host)    | Sends a Reset since SYN+ACK is from a different endpoint          |
|------------------+------------------+------------------+-----------------------------|

So, even if we fix OVN with multiple CT zones for hairpin traffic, then we have the above reset issue. I tried using PBR using ip-rule, iptables connection marking, SNAT, and so on but nothing worked.

shettyg commented 5 years ago

I think it makes sense to introduce the second gateway router for this purpose. Do it the right way.