openshift / openshift-sdn

Apache License 2.0
69 stars 63 forks source link

Some rules missing when running master + minion on the same node #100

Closed mrunalp closed 8 years ago

mrunalp commented 9 years ago
    [root@ose-master ~]# ovs-ofctl -O OpenFlow13 dump-flows br0
    OFPST_FLOW reply (OF1.3) (xid=0x2):
     cookie=0xc0a86403, duration=211.398s, table=0, n_packets=104403, n_bytes=4384926, priority=100,arp,arp_tpa=10.1.2.0/24 actions=set_field:192.168.100.3->tun_dst,output:1
     cookie=0xc0a86403, duration=211.401s, table=0, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.2.0/24 actions=set_field:192.168.100.3->tun_dst,output:1
     cookie=0xc0a86402, duration=211.409s, table=0, n_packets=1, n_bytes=102, priority=75,ip,nw_dst=10.1.1.0/24 actions=output:9
     cookie=0xc0a86402, duration=211.405s, table=0, n_packets=105165, n_bytes=4416930, priority=75,arp,arp_tpa=10.1.1.0/24 actions=output:9
     cookie=0xc0a86404, duration=211.389s, table=0, n_packets=756, n_bytes=31752, priority=100,arp,arp_tpa=10.1.0.0/24 actions=set_field:192.168.100.4->tun_dst,output:1
     cookie=0xc0a86404, duration=211.393s, table=0, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.0.0/24 actions=set_field:192.168.100.4->tun_dst,output:1

vs the expected

[root@ose-master ~]# ovs-ofctl -O OpenFlow13 dump-flows br0
OFPST_FLOW reply (OF1.3) (xid=0x2):
 cookie=0x0, duration=8.723s, table=0, n_packets=3, n_bytes=210, priority=50 actions=output:2
 cookie=0x0, duration=8.717s, table=0, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.1.1.1 actions=output:2
 cookie=0x0, duration=8.713s, table=0, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.1.1 actions=output:2
 cookie=0xc0a86403, duration=8.663s, table=0, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.1.2.0/24 actions=set_field:192.168.100.3->tun_dst,output:1
 cookie=0xc0a86403, duration=8.667s, table=0, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.2.0/24 actions=set_field:192.168.100.3->tun_dst,output:1
 cookie=0xc0a86402, duration=8.675s, table=0, n_packets=0, n_bytes=0, priority=75,ip,nw_dst=10.1.1.0/24 actions=output:9
 cookie=0xc0a86402, duration=8.671s, table=0, n_packets=0, n_bytes=0, priority=75,arp,arp_tpa=10.1.1.0/24 actions=output:9
 cookie=0xc0a86404, duration=8.656s, table=0, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.1.0.0/24 actions=set_field:192.168.100.4->tun_dst,output:1
 cookie=0xc0a86404, duration=8.659s, table=0, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.0.0/24 actions=set_field:192.168.100.4->tun_dst,output:1
rajatchopra commented 9 years ago

Race condition: Thread1: kubelet checks for docker version during its setup Thread2: sdn-node restarts docker daemon during its setup

A restart of docker daemon can cause the docker version check to fail and openshift-node will panic and exit. Upon restart the SDN setup is skipped and some of the OVS rules never get re-instated.

bendikp commented 8 years ago

I'm not sure if this is related to this issue, but I have encountered the same problem after a reboot of a node only running origin-node. Origin-node fails on the first startup attempt and when restarted everything seems okey. But some OVS-rules are missing causing the pods on the node to not be able to contact external resources.

ip route from a pod running on the node that is missing OVS-rules

bash-4.2$ ip route
default via 10.254.6.1 dev eth0 
10.254.6.0/24 dev eth0  proto kernel  scope link  src 10.254.6.14

ip route from a similar pod running on another node with all the OVS-rules

bash-4.2$ ip route
default via 10.254.5.1 dev eth0 
10.254.0.0/16 dev eth0  proto kernel  scope link  src 10.254.5.23 
10.254.5.0/24 dev eth0  proto kernel  scope link  src 10.254.5.23 

Is it possible to add an if-statement to the setup_required() function in openshift-sdn-kube-subnet-setup.sh to make sure all the OVS rules are configured?

Something like this:

if ! ovs-ofctl -O OpenFlow13 dump-flows br0 | grep -q "output:2"; then
        return 0
fi
danwinship commented 8 years ago

Is it possible to add an if-statement to the setup_required() function in openshift-sdn-kube-subnet-setup.sh to make sure all the OVS rules are configured?

That exists in current git master and has been Godep'ed over to origin, and I'm pretty sure it's in the latest ose builds now too.

bendikp commented 8 years ago

Is this the commit you are referring to? https://github.com/openshift/openshift-sdn/commit/b79a746ca73d94dd2acecfd3e1365b964e22e8fd

As fare as I can tell this only checks that there are some OVS-rules configured, not the specific ones missing after the race condition has occured.

danwinship commented 8 years ago

Ah, yes.

Origin-node fails on the first startup attempt and when restarted everything seems okey.

How does it fail exactly? What gets logged?

bendikp commented 8 years ago

This is the output from /var/log/messages when Origin Node tries to start after a reboot of the server:

Oct 22 12:19:11 openshift-node-08 systemd: Starting Origin Node...
Oct 22 12:19:11 openshift-node-08 origin-node: I1022 12:19:11.459000    2540 start_node.go:175] Starting a node connected to https://openshift-master-06:8443
Oct 22 12:19:11 openshift-node-08 origin-node: I1022 12:19:11.462598    2540 start_node.go:267] Starting node openshift-node-08 (v1.0.6-2-ge2a02a8)
Oct 22 12:19:11 openshift-node-08 docker: time="2015-10-22T12:19:11.465233232+02:00" level=info msg="GET /_ping"
Oct 22 12:19:11 openshift-node-08 origin-node: I1022 12:19:11.465703    2540 node.go:53] Connecting to Docker at unix:///var/run/docker.sock
Oct 22 12:19:11 openshift-node-08 docker: time="2015-10-22T12:19:11.465977830+02:00" level=info msg="GET /version"
Oct 22 12:19:11 openshift-node-08 origin-node: I1022 12:19:11.466623    2540 common.go:76] Self IP: xx.xx.xx.xx.
Oct 22 12:19:11 openshift-node-08 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl del-br br0
Oct 22 12:19:11 openshift-node-08 kernel: device br0 left promiscuous mode
Oct 22 12:19:11 openshift-node-08 kernel: device tun0 left promiscuous mode
Oct 22 12:19:11 openshift-node-08 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-br br0 -- set Bridge br0 fail-mode=secure
Oct 22 12:19:11 openshift-node-08 kernel: openvswitch: netlink: Key attribute has unexpected length (type=21, length=4, expected=0).
Oct 22 12:19:11 openshift-node-08 kernel: device br0 entered promiscuous mode
Oct 22 12:19:11 openshift-node-08 kernel: openvswitch: netlink: Flow get message rejected, Key attribute missing.
Oct 22 12:19:11 openshift-node-08 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set bridge br0 protocols=OpenFlow13
Oct 22 12:19:11 openshift-node-08 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl del-port br0 vxlan0
Oct 22 12:19:11 openshift-node-08 ovs-vsctl: ovs|00002|vsctl|ERR|no port named vxlan0
Oct 22 12:19:11 openshift-node-08 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 vxlan0 -- set Interface vxlan0 type=vxlan options:remote_ip=flow options:key=flow ofport_request=1
Oct 22 12:19:11 openshift-node-08 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 tun0 -- set Interface tun0 type=internal ofport_request=2
Oct 22 12:19:11 openshift-node-08 kernel: device tun0 entered promiscuous mode
Oct 22 12:19:11 openshift-node-08 kernel: IPv6: ADDRCONF(NETDEV_UP): vlinuxbr: link is not ready
Oct 22 12:19:11 openshift-node-08 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vlinuxbr: link becomes ready
Oct 22 12:19:11 openshift-node-08 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl del-port br0 vovsbr
Oct 22 12:19:11 openshift-node-08 ovs-vsctl: ovs|00002|vsctl|ERR|no port named vovsbr
Oct 22 12:19:11 openshift-node-08 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 vovsbr -- set Interface vovsbr ofport_request=9
Oct 22 12:19:11 openshift-node-08 kernel: device vovsbr entered promiscuous mode
Oct 22 12:19:11 openshift-node-08 kernel: device vlinuxbr entered promiscuous mode
Oct 22 12:19:11 openshift-node-08 kernel: lbr0: port 1(vlinuxbr) entered forwarding state
Oct 22 12:19:11 openshift-node-08 kernel: lbr0: port 1(vlinuxbr) entered forwarding state
Oct 22 12:19:11 openshift-node-08 systemd: Reloading.
Oct 22 12:19:11 openshift-node-08 systemd: [/usr/lib/systemd/system/lvm2-lvmetad.socket:9] Unknown lvalue 'RemoveOnStop' in section 'Socket'
Oct 22 12:19:11 openshift-node-08 systemd: [/usr/lib/systemd/system/dm-event.socket:10] Unknown lvalue 'RemoveOnStop' in section 'Socket'
Oct 22 12:19:11 openshift-node-08 systemd: [/usr/lib/systemd/system/openvswitch-nonetwork.service:13] Unknown lvalue 'RuntimeDirectory' in section 'Service'
Oct 22 12:19:11 openshift-node-08 systemd: [/usr/lib/systemd/system/openvswitch-nonetwork.service:14] Unknown lvalue 'RuntimeDirectoryMode' in section 'Service'
Oct 22 12:19:11 openshift-node-08 systemd: Stopping Docker Application Container Engine...
Oct 22 12:19:11 openshift-node-08 docker: time="2015-10-22T12:19:11.736745088+02:00" level=info msg="Processing signal 'terminated'"
Oct 22 12:19:11 openshift-node-08 origin-node: F1022 12:19:11.773004    2540 node.go:85] ERROR: Unable to check for Docker server version.
Oct 22 12:19:11 openshift-node-08 origin-node: unexpected EOF
Oct 22 12:19:11 openshift-node-08 systemd: Starting Docker Storage Setup...
Oct 22 12:19:11 openshift-node-08 systemd: origin-node.service: main process exited, code=exited, status=255/n/a
Oct 22 12:19:11 openshift-node-08 systemd: Failed to start Origin Node.

When Origin Node is started again everything seems okey.

Oct 22 12:19:22 openshift-node-08 systemd: Starting Origin Node...
Oct 22 12:19:22 openshift-node-08 origin-node: I1022 12:19:22.871011    2769 start_node.go:175] Starting a node connected to https://openshift-master-06:8443
Oct 22 12:19:22 openshift-node-08 origin-node: I1022 12:19:22.873649    2769 start_node.go:267] Starting node openshift-node-08 (v1.0.6-2-ge2a02a8)
Oct 22 12:19:22 openshift-node-08 docker: time="2015-10-22T12:19:22.876052024+02:00" level=info msg="GET /_ping"
Oct 22 12:19:22 openshift-node-08 origin-node: I1022 12:19:22.876231    2769 node.go:53] Connecting to Docker at unix:///var/run/docker.sock
Oct 22 12:19:22 openshift-node-08 docker: time="2015-10-22T12:19:22.876359738+02:00" level=info msg="GET /version"
Oct 22 12:19:22 openshift-node-08 origin-node: I1022 12:19:22.878095    2769 common.go:76] Self IP: xx.xx.xx.xx.
Oct 22 12:19:22 openshift-node-08 origin-node: I1022 12:19:22.915042    2769 manager.go:127] cAdvisor running in container: "/"
Oct 22 12:19:22 openshift-node-08 origin-node: I1022 12:19:22.915482    2769 proxier.go:125] Setting proxy IP to xx.xx.xx.xx and initializing iptables
Oct 22 12:19:22 openshift-node-08 origin-node: I1022 12:19:22.915478    2769 fs.go:93] Filesystem partitions: map[/dev/mapper/rhel-root:{mountpoint:/ major:253 minor:0} /dev/sda1:{mountpoint:/boot major:8 minor:1}]
Oct 22 12:19:22 openshift-node-08 systemd: Started Origin Node.
Oct 22 12:19:22 openshift-node-08 origin-node: I1022 12:19:22.924860    2769 manager.go:158] Machine: {NumCores:8 CpuFrequency:2800000 MemoryCapacity:16656232448 MachineID:a05388241fe34f378ef310499558f6fe SystemUUID:4203A0C0-DFD0-C30A-4E2
8-56516497DFE6 BootID:0e183bd9-ffa4-4424-841d-2c76dd48c491 Filesystems:[{Device:/dev/mapper/rhel-root Capacity:204372480000} {Device:/dev/sda1 Capacity:1045082112}] DiskMap:map[8:0:{Name:sda Major:8 Minor:0 Size:214748364800 Scheduler:dead
line} 253:0:{Name:dm-0 Major:253 Minor:0 Size:204472320000 Scheduler:none} 253:1:{Name:dm-1 Major:253 Minor:1 Size:9223274496 Scheduler:none} 253:2:{Name:dm-2 Major:253 Minor:2 Size:107374182400 Scheduler:none} 2:0:{Name:fd0 Major:2 Minor:
0 Size:0 Scheduler:deadline}] NetworkDevices:[{Name:br0 MacAddress:4a:0b:69:ee:2d:44 Speed:0 Mtu:1500} {Name:eth0 MacAddress:00:50:56:83:5e:b7 Speed:10000 Mtu:1500} {Name:lbr0 MacAddress:02:64:f1:78:d5:ea Speed:0 Mtu:1500} {Name:ovs-system
 MacAddress:ee:52:69:68:39:07 Speed:0 Mtu:1500} {Name:tun0 MacAddress:06:c7:fa:1c:6b:c7 Speed:0 Mtu:1500} {Name:vlinuxbr MacAddress:02:64:f1:78:d5:ea Speed:10000 Mtu:1500} {Name:vovsbr MacAddress:7e:5f:ac:59:81:ea Speed:10000 Mtu:1500}] To
pology:[{Id:0 Memory:17179336704 Cores:[{Id:0 Threads:[0] Caches:[]} {Id:1 Threads:[1] Caches:[]}] Caches:[{Size:26214400 Type:Unified Level:3}]} {Id:1 Memory:0 Cores:[{Id:0 Threads:[2] Caches:[]} {Id:1 Threads:[3] Caches:[]}] Caches:[{Siz
e:26214400 Type:Unified Level:3}]} {Id:2 Memory:0 Cores:[{Id:0 Threads:[4] Caches:[]} {Id:1 Threads:[5] Caches:[]}] Caches:[{Size:26214400 Type:Unified Level:3}]} {Id:3 Memory:0 Cores:[{Id:0 Threads:[6] Caches:[]} {Id:1 Threads:[7] Caches:
[]}] Caches:[{Size:26214400 Type:Unified Level:3}]}] CloudProvider:Unknown InstanceType:Unknown}
Oct 22 12:19:22 openshift-node-08 docker: time="2015-10-22T12:19:22.925506874+02:00" level=info msg="GET /version"
Oct 22 12:19:22 openshift-node-08 origin-node: I1022 12:19:22.939426    2769 node.go:196] Started Kubernetes Proxy on 0.0.0.0
Oct 22 12:19:22 openshift-node-08 origin-node: I1022 12:19:22.939873    2769 kube.go:29] Output of setup script:
Oct 22 12:19:22 openshift-node-08 origin-node: + lock_file=/var/lock/openshift-sdn.lock
Oct 22 12:19:22 openshift-node-08 origin-node: + subnet_gateway=10.254.2.1
Oct 22 12:19:22 openshift-node-08 origin-node: + subnet=10.254.2.0/24
Oct 22 12:19:22 openshift-node-08 origin-node: + cluster_subnet=10.254.0.0/16
Oct 22 12:19:22 openshift-node-08 origin-node: + subnet_mask_len=24
Oct 22 12:19:22 openshift-node-08 origin-node: + tun_gateway=10.254.2.1
Oct 22 12:19:22 openshift-node-08 origin-node: + mtu=1450
Oct 22 12:19:22 openshift-node-08 origin-node: + printf 'Container network is "%s"; local host has subnet "%s", mtu "%d" and gateway "%s".\n' 10.254.0.0/16 10.254.2.0/24 1450 10.254.2.1
Oct 22 12:19:22 openshift-node-08 origin-node: Container network is "10.254.0.0/16"; local host has subnet "10.254.2.0/24", mtu "1450" and gateway "10.254.2.1".
Oct 22 12:19:22 openshift-node-08 origin-node: + TUN=tun0
Oct 22 12:19:22 openshift-node-08 origin-node: + set +e
Oct 22 12:19:22 openshift-node-08 origin-node: + setup_required
Oct 22 12:19:22 openshift-node-08 origin-node: +++ ip a s lbr0
Oct 22 12:19:22 openshift-node-08 origin-node: +++ awk '/inet / {print $2}'
Oct 22 12:19:22 openshift-node-08 origin-node: ++ echo 10.254.2.1/24
Oct 22 12:19:22 openshift-node-08 origin-node: + ip=10.254.2.1/24
Oct 22 12:19:22 openshift-node-08 origin-node: + '[' 10.254.2.1/24 '!=' 10.254.2.1/24 ']'
Oct 22 12:19:22 openshift-node-08 origin-node: + grep -q lbr0 /run/openshift-sdn/docker-network
Oct 22 12:19:22 openshift-node-08 origin-node: + return 1
Oct 22 12:19:22 openshift-node-08 origin-node: + echo 'SDN setup not required.'
Oct 22 12:19:22 openshift-node-08 origin-node: SDN setup not required.
Oct 22 12:19:22 openshift-node-08 origin-node: + exit 140

Except the OVS-rules for output:2 were missing when I ran ovs-ofctl -O OpenFlow13 dump-flows br0.

eparis commented 8 years ago

We believe this is currently working. Please let us know if you still have any problems.