Timeout on first connection attempt in multi-tenant mode after some time of inactivity (origin 1.1.0.1)

barkbay commented 8 years ago

Hi all,

I'm stuck for several days on an issue where after quite a long time of inactivity I get a connection timeout for the first connection between 2 pods (subsequent connections are OK) :

lenaic-node-compute-1ba08.example.com origin-node[2055]: I1220 15:08:33.591561    2055 proxysocket.go:96] Mapped service "postgresql2/postgresql:postgresql" to endpoint 10.1.1.2:5432
lenaic-node-compute-1ba08.example.com origin-node[2055]: E1220 15:08:34.591741    2055 proxysocket.go:104] Dial failed: dial tcp 10.1.1.2:5432: i/o timeout
lenaic-node-compute-1ba08.example.com origin-node[2055]: I1220 15:08:34.591788    2055 roundrobin.go:130] NextEndpoint for service "postgresql2/postgresql:postgresql", srcAddr=10.1.2.2:37328: endpoints: [10.1.1.2:5432]
lenaic-node-compute-1ba08.example.com origin-node[2055]: I1220 15:08:34.591806    2055 proxysocket.go:96] Mapped service "postgresql2/postgresql:postgresql" to endpoint 10.1.1.2:5432
lenaic-node-compute-1ba08.example.com origin-node[2055]: E1220 15:08:36.592010    2055 proxysocket.go:104] Dial failed: dial tcp 10.1.1.2:5432: i/o timeout
lenaic-node-compute-1ba08.example.com origin-node[2055]: E1220 15:08:36.592043    2055 proxysocket.go:138] Failed to connect to balancer: failed to connect to an endpoint.

It seems to happen if no network frame was transmitted during a period equals to the value "hard_timeout" in the Openvswitch learn table (8). I have decreased this value from 900 seconds to 60 seconds and I can consistently reproduce the issue if I wait 60 seconds between each connection attempt.

Here is a small diagram of my setup : https://www.dropbox.com/s/esu2gtiz8y6js1x/use_case.png?dl=1 Here is the result of the debug.sh script : https://www.dropbox.com/s/k8x4i2308wpkqlu/openshift-sdn-debug-2015-12-20.tgz?dl=1

I would say that it is related to an ARP issue, here what I see if I do a tcpdump on the VXLAN connection (client side):

[root@lenaic-node-compute-1ba08 openshift]# tcpdump -i eth0 -nn -v udp port 4789 
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 
14:11:22.222814 IP (tos 0x0, ttl 64, id 1049, offset 0, flags [DF], proto UDP (17), length 110) 
   192.168.55.233.36826 > 192.168.55.175.4789: VXLAN, flags [I] (0x08), vni 0 
IP (tos 0x0, ttl 64, id 3905, offset 0, flags [DF], proto TCP (6), length 60) 
   10.1.2.1.60708 > 10.1.1.2.5432: Flags [S], cksum 0x7bfd (correct), seq 2334882139, win 14100, options [mss 1410,sackOK,TS val 18773530 ecr 0,nop,wscale 7], length 0 
                                    [... cut a lot of SYN packets ...]
14:11:24.976028 IP (tos 0x0, ttl 64, id 1053, offset 0, flags [DF], proto UDP (17), length 110) 
   192.168.55.233.35083 > 192.168.55.175.4789: VXLAN, flags [I] (0x08), vni 0 
IP (tos 0x0, ttl 64, id 3443, offset 0, flags [DF], proto TCP (6), length 60) 
   10.1.2.1.60715 > 10.1.1.2.5432: Flags [S], cksum 0xb13d (correct), seq 1392144771, win 14100, options [mss 1410,sackOK,TS val 18776284 ecr 0,nop,wscale 7], length 0 
14:11:27.236170 IP (tos 0x0, ttl 64, id 1054, offset 0, flags [DF], proto UDP (17), length 78) 
   192.168.55.233.39892 > 192.168.55.175.4789: VXLAN, flags [I] (0x08), vni 0 
ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.1.1.2 tell 10.1.2.1, length 28 
14:11:27.236741 IP (tos 0x0, ttl 64, id 17103, offset 0, flags [DF], proto UDP (17), length 78) 
   192.168.55.175.37297 > 192.168.55.233.4789: VXLAN, flags [I] (0x08), vni 0 
ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.1.2 is-at 02:42:0a:01:01:02, length 28

(By the way you can see that the vni is 0, not sure if it is expected in a multitenant environment) The interesting point here is that once the ARP request is done then subsequent connections are successful !

In the mean time these flows are created in the ovs flow table on the client :

cookie=0x0, duration=14.426s, table=8, n_packets=16, n_bytes=1390, hard_timeout=60, priority=200,dl_dst=02:42:0a:01:02:02 actions=load:0->NXM_NX_TUN_IPV4_DST[],output:4 
cookie=0x0, duration=14.426s, table=8, n_packets=3, n_bytes=126, hard_timeout=60, priority=200,dl_dst=02:9e:4d:7f:00:a3 actions=load:0->NXM_NX_TUN_IPV4_DST[],output:2 
cookie=0x0, duration=9.419s, table=8, n_packets=0, n_bytes=0, hard_timeout=60, priority=200,dl_dst=02:42:0a:01:01:02 actions=load:0xc0a837af->NXM_NX_TUN_IPV4_DST[],output:1

and these flows are created on the server :

cookie=0x0, duration=10.301s, table=8, n_packets=1, n_bytes=42, hard_timeout=60, priority=200,dl_dst=02:9e:4d:7f:00:a3 actions=load:0xc0a837e9->NXM_NX_TUN_IPV4_DST[],output:1 
cookie=0x0, duration=5.291s, table=8, n_packets=0, n_bytes=0, hard_timeout=60, priority=200,dl_dst=02:42:0a:01:01:02 actions=load:0->NXM_NX_TUN_IPV4_DST[],output:4

Once they have expired the problem appears again.

As a side note there is no issue if I use the pod IP (10.1.1.2) instead of the service IP (172.30.184.188)

Let me know if you need any other information, this problem prevents us from activating the multitenant mode.

Thank you !

barkbay commented 8 years ago

On the VXLAN interface the VNID is lost when the service IP (172.30.184.188) is used :

` [root@lenaic-node-compute-1ba08 openshift]# tcpdump -vvv -nn -i eth0 udp|grep vni 192.168.55.233.59854 > 192.168.55.175.4789: [no cksum] VXLAN, flags [I] (0x08), vni 0 192.168.55.175.48585 > 192.168.55.233.4789: [no cksum] VXLAN, flags [I] (0x08), vni 13 192.168.55.233.59854 > 192.168.55.175.4789: [no cksum] VXLAN, flags [I] (0x08), vni 0 192.168.55.233.59854 > 192.168.55.175.4789: [no cksum] VXLAN, flags [I] (0x08), vni 0 [....] `

If I use the pod IP (10.1.1.2), everything is OK : ` [root@lenaic-node-compute-1ba08 openshift]# tcpdump -vvv -nn -i eth0 udp|grep vni 192.168.55.233.43037 > 192.168.55.175.4789: [no cksum] VXLAN, flags [I] (0x08), vni 13 192.168.55.175.44208 > 192.168.55.233.4789: [no cksum] VXLAN, flags [I] (0x08), vni 13 192.168.55.233.43037 > 192.168.55.175.4789: [no cksum] VXLAN, flags [I] (0x08), vni 13 [...] `

I think this is the root of the problem. Any idea ?

rajatchopra commented 8 years ago

'vnid 0' is expected because the service proxy has global access. Not sure why the ARP learning takes so long that a connection time out has to happen. /cc @danwinship

barkbay commented 8 years ago

Why do we have to wait for an ARP request ? It's not the purpose of the table 6 to send the data to the right port ?

cookie=0x0, duration=3756.657s, table=6, n_packets=82, n_bytes=6511, priority=200,ip,reg0=0 actions=goto_table:8
cookie=0x0, duration=3751.984s, table=6, n_packets=0, n_bytes=0, priority=100,ip,reg0=0xd,nw_dst=10.1.1.2 actions=output:4

Port 4 is the pod I want to connect to

danwinship commented 8 years ago

Why do we have to wait for an ARP request ? It's not the purpose of the table 6 to send the data to the right port ?

Yes, but the pods don't know about that; they think they are connected to an ethernet network, so the source pod can't send any data to the destination until it knows its MAC address (since it has to write that address into the packet).

danwinship commented 8 years ago

@barkbay Try editing /usr/bin/openshift-sdn-ovs-setup.sh on the node with the service; search for the rule that starts with "table=3" and ends with "goto_table:7", and change that to "goto_table:6" (and restart the node). Fixed?

barkbay commented 8 years ago

@danwinship Thanks for the tip fyi here is the fix I'm currently testing on my cluster : https://gist.github.com/barkbay/cf57b2b4022bbc2d02ba So far it seems to also solve the problem.

I'll try your suggestion and update the issue.

barkbay commented 8 years ago

/usr/bin/openshift-sdn-ovs-setup.sh does not exist in Openshift v1.1.0.1

danwinship commented 8 years ago

/usr/bin/openshift-sdn-ovs-setup.sh does not exist in Openshift v1.1.0.1

Hm... the bug fixed in #236 was introduced after openshift-sdn-kube-subnet-setup.sh and openshift-sdn-multitenant-setup.sh were merged.

Looking at your patch, it is basically bypassing the ARP cache table. So if that fixes things, that suggests that the problem is that the sender still has the MAC address cached, so doesn't need to send an ARP, but OVS has dropped it from its cache. And looking at the rules, that definitely seems like it could happen...

barkbay commented 8 years ago

Looking at your patch, it is basically bypassing the ARP cache table. So if that fixes things, that suggests that the problem is that the sender still has the MAC address cached, so doesn't need to send an ARP, but OVS has dropped it from its cache. And looking at the rules, that definitely seems like it could happen...

Congratulation :+1: Indeed if I delete (regularly) the ARP entries on the hosts the problem no longer appears. I need a battle plan. Do you think my patch is an acceptable solution? Let me know how can I help you

danwinship commented 8 years ago

Your patch leaves redundant rules, so it's not quite right. Between this and #239, it looks like we'll need some heavy rewriting of the OVS rules. I'm working on it.

mmaier2 commented 8 years ago

As it seems, we're facing the same problem on an openshift enterprise 3.0.2. Between an openshift php pod with wordpress and a mysql DB pod. So at times - not really reproduceable - the wordpress complaines that it could not connect to the mysql (Service via Cluster IP). And indeed any connection (curl, telnet etc.) to the cluster IP + port of the mysql service leads to a timeout. But after a connection to the IP of the mysql pod itself happens, the service cluster IP again works, so the mysql server responds on the cluster ip.

openshift / openshift-sdn

Timeout on first connection attempt in multi-tenant mode after some time of inactivity (origin 1.1.0.1) #231