openvswitch / ovs-issues

Issue tracker repo for Open vSwitch
10 stars 3 forks source link

[windows] Port may be failed with error "could not add network device xxx to ofproto (Invalid argument)" in containerd environment #343

Open twofish197 opened 5 days ago

twofish197 commented 5 days ago

On Windows platform, we created Deployed Windows Large Cluster with 3 Ubuntu CP node and 100+ Windows Worker Nodes,

On Windows node ovs is to create one containerd pod on host to create vNIC, and set ports type to internal to support the connections between ovs and pods. It is found port creating error on some Windows node(2% -3%) during the test. Below is the output of CMD "ovs-vsctl show".

Bridge br-int datapath_type: system Port antrea-gw0 Interface antrea-gw0 type: internal Port antrea-tun0 Interface antrea-tun0 type: geneve options: {key=flow, local_ip="10.244.3.24", remote_ip=flow} Port eth0 Interface eth0 Port br-int Interface br-int type: internal Port vsphere--c546b0 Interface vsphere--c546b0 type: internal error: "could not add network device vsphere--c546b0 to ofproto (Invalid argument)"

After any ovsdb-server config change, it could be recovered.  Below is the CMD used(just one example). Restart ovs-vswitchd

could also fix this issue.

ovs-vsctl.exe --no-wait add-port br-int podvif38 -- set interface podvif38 
ovs-vsctl.exe --no-wait del-port br-int podvif38

 Below is the complete the ovs-vswitchd.log on failed node.
twofish197 commented 5 days ago

Attach the failed log for file ovs-vswitchd_failed_port_allocating.log

twofish197 commented 5 days ago

ovs-vswitchd_failed_port_allocating.log

twofish197 commented 5 days ago

After the debugging on some failed windows vm. This issue should be an known issue which does have a fix via commit below.

So it is likely ovs-windows will block some port allocating to avoid some unrecoverble case.

netdev-windows: Add checking when creating netdev with system type on Windows https://github.com/openvswitch/ovs/commit/1cdc0529f742a03bc6ed615de897eb68cf140ac1

Quoting the bug description here, Some system type port will be created netdev successfully and it will cause conflict as in the dpif side it will be internal type. So finally the port will be created failed and it could not be easily recovered.

With the patch, on Windows the netdev creating will be blocked for system type when the ovs_type got on dpif is internal. More detailed case description is in the reported issue No.262 with link below. https://github.com/openvswitch/ovs-issues/issues/262

In current ovs windows logic, the failed port adding on ovs does needs the extra config change on ovsdb server. It may be checked if we could add some logic in ovs userspace to do the resyncing when ovs windows is blocking some port adding. It will be tracked by this upstream issue on ovs.