openvswitch / ovs-issues

Issue tracker repo for Open vSwitch
10 stars 3 forks source link

ovsdpdk vswitchd dead lock when use bond_mode = balance-tcp #264

Closed BigCousin-z closed 1 year ago

BigCousin-z commented 1 year ago

dpdk: 21.11 ovs:2.17.2

When the bond mode in the ovsdpdk is balance-tcp, the device will be restarted, and the deadlock will be triggered probability, but no valid information is found in the log.

log info: ovs-vswitchd:ofproto/ofproto.c:6282: pthread_mutex_lock failed (Resource deadlock avoided)

2022-09-16T06:58:33.448Z|00167|bridge|INFO|bridge br_dpdk: added interface enp98s0 on port 1 2022-09-16T06:58:33.450Z|00168|dpdk|INFO|Device with port_id=2 already stopped 2022-09-16T06:58:33.575Z|00169|dpdk|ERR|i40e_pf_get_vsi_by_qindex(): queue_idx out of range. VMDQ configured? 2022-09-16T06:58:33.575Z|00170|netdev_dpdk|INFO|Interface enp97s0 unable to setup txq(64): Invalid argument 2022-09-16T06:58:33.575Z|00171|netdev_dpdk|INFO|Retrying setup with (rxq:8 txq:64) 2022-09-16T06:58:33.782Z|00172|netdev_dpdk|INFO|Port 2: 40:a6:b7:20:6a:80 2022-09-16T06:58:33.783Z|00173|dpif_netdev|INFO|Performing pmd to rx queue assignment using cycles algorithm. 2022-09-16T06:58:33.783Z|00174|dpif_netdev|INFO|Core 32 on numa node 4 assigned port 'enp98s0' rx queue 0

2022-09-16T06:58:33.795Z|00208|bridge|INFO|bridge br_dpdk: added interface br_dpdk on port 65534 2022-09-16T06:58:33.796Z|00209|bridge|INFO|bridge br-ex: added interface patch-provnet-efcdafee-c427-4a54-9467-f77ea661a171-to-br-int on port 3 2022-09-16T06:58:33.797Z|00210|bridge|INFO|bridge br-ex: added interface patch-provnet-f8485b7d-4114-4f64-b4cd-10781ec0c649-to-br-int on port 4 2022-09-16T06:58:33.797Z|00211|dpdk|INFO|Device with port_id=0 already stopped 2022-09-16T06:58:33.892Z|00001|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce 2022-09-16T06:58:33.969Z|00212|dpdk|ERR|i40e_pf_get_vsi_by_qindex(): queue_idx out of range. VMDQ configured? 2022-09-16T06:58:33.969Z|00213|netdev_dpdk|INFO|Interface enp33s0 unable to setup txq(64): Invalid argument 2022-09-16T06:58:33.969Z|00214|netdev_dpdk|INFO|Retrying setup with (rxq:8 txq:64) 2022-09-16T06:58:34.172Z|00215|netdev_dpdk|INFO|Port 0: 40:a6:b7:20:6a:48 2022-09-16T06:58:34.173Z|00216|dpif_netdev|INFO|Performing pmd to rx queue assignment using cycles algorithm. 2022-09-16T06:58:34.173Z|00217|dpif_netdev|INFO|Core 32 on numa node 4 assigned port 'enp98s0' rx queue 0 (measured processing cycles 0). 2022-09-16T06:58:34.173Z|00218|dpif_netdev|INFO|Core 33 on numa node 4 assigned port

2022-09-16T06:58:34.578Z|00514|bond|INFO|member enp98s0: enabled 2022-09-16T06:58:34.578Z|00515|bond|INFO|member enp97s0: enabled 2022-09-16T06:58:34.578Z|00516|connmgr|INFO|br_dpdk: added service controller "punix:/var/run/openvswitch/br_dpdk.mgmt" 2022-09-16T06:58:34.579Z|00517|bridge|INFO|bridge br-ex: using datapath ID 000040a6b7206a48 2022-09-16T06:58:34.579Z|00518|bond|INFO|member enp33s0: enabled 2022-09-16T06:58:34.579Z|00519|bond|INFO|member enp34s0: enabled 2022-09-16T06:58:34.579Z|00520|connmgr|INFO|br-ex: added service controller "punix:/var/run/openvswitch/br-ex.mgmt" 2022-09-16T06:58:34.579Z|00521|bridge|INFO|bridge br-int: using datapath ID 000036a99830714f 2022-09-16T06:58:34.579Z|00522|bfd|INFO|ovn-1adb6b-0: BFD state change: admin_down->down "No Diagnostic"->"No Diagnostic". Forwarding: false

2022-09-16T06:58:34.662Z|00601|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.17.2 2022-09-16T06:58:34.668Z|00602|netdev_linux|INFO|tunl0 device has unknown hardware address family 768 2022-09-16T06:58:36.677Z|00603|bond|INFO|member enp97s0: link state up 2022-09-16T06:58:36.677Z|00604|bond|INFO|member enp97s0: enabled 2022-09-16T06:58:36.677Z|00605|bond|INFO|bond bond2: active member is now enp97s0 2022-09-16T06:58:36.677Z|00606|bond|INFO|member enp33s0: link state up 2022-09-16T06:58:36.677Z|00607|bond|INFO|member enp33s0: enabled 2022-09-16T06:58:36.677Z|00608|bond|INFO|member enp34s0: link state up 2022-09-16T06:58:36.677Z|00609|bond|INFO|member enp34s0: enabled 2022-09-16T06:58:36.677Z|00610|bond|INFO|bond bond1: active member is now enp33s0 2022-09-16T06:58:37.187Z|00611|bond|INFO|member enp98s0: link state up 2022-09-16T06:58:37.187Z|00612|bond|INFO|member enp98s0: enabled 2022-09-16T06:58:40.194Z|00613|memory|INFO|1245532 kB peak resident set size after 10.1 seconds 2022-09-16T06:58:40.195Z|00614|memory|INFO|handlers:1 idl-cells:8023 ports:140 revalidators:1 rules:270 udpif keys:62 2022-09-16T06:58:41.195Z|00002|bfd(pmd-c26/id:61)|INFO|Interface ovn-f56f52-0 remote mult value 0 changed to 3 2022-09-16T06:58:41.195Z|00003|bfd(pmd-c26/id:61)|INFO|Dropped 36 log messages in last 7 seconds (most recently, 7 seconds ago) due to excessive rate 2022-09-16T06:58:41.195Z|00004|bfd(pmd-c26/id:61)|INFO|ovn-f56f52-0: New remote min_rx. vers:1 diag:"Control Detection Time Expired" state:down mult:3 length:24 flags: none

BigCousin-z commented 1 year ago

2022-09-16T06:58:30.131Z|00015|dpdk|INFO|Using DPDK 21.11.0 2022-09-16T06:58:30.131Z|00016|dpdk|INFO|DPDK Enabled - initializing... 2022-09-16T06:58:30.131Z|00017|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch 2022-09-16T06:58:30.131Z|00018|dpdk|INFO|IOMMU support for vhost-user-client disabled. 2022-09-16T06:58:30.131Z|00019|dpdk|INFO|POSTCOPY support for vhost-user-client disabled. 2022-09-16T06:58:30.131Z|00020|dpdk|INFO|Per port memory for DPDK devices disabled. 2022-09-16T06:58:30.131Z|00021|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x3 --huge-dir /dev/hugepages --socket-mem 2048,2048 --in-memory. 2022-09-16T06:58:30.137Z|00022|dpdk|INFO|EAL: Detected CPU lcores: 128 2022-09-16T06:58:30.137Z|00023|dpdk|INFO|EAL: Detected NUMA nodes: 8 2022-09-16T06:58:30.137Z|00024|dpdk|INFO|EAL: Detected shared linkage of DPDK 2022-09-16T06:58:30.170Z|00025|dpdk|INFO|EAL: Selected IOVA mode 'VA' 2022-09-16T06:58:30.170Z|00026|dpdk|WARN|EAL: No available 2048 kB hugepages reported 2022-09-16T06:58:30.171Z|00027|dpdk|WARN|EAL: No free 2048 kB hugepages reported on node 0 2022-09-16T06:58:30.171Z|00028|dpdk|WARN|EAL: No free 2048 kB hugepages reported on node 1 2022-09-16T06:58:30.171Z|00029|dpdk|WARN|EAL: No free 2048 kB hugepages reported on node 2 2022-09-16T06:58:30.171Z|00030|dpdk|WARN|EAL: No free 2048 kB hugepages reported on node 3 2022-09-16T06:58:30.171Z|00031|dpdk|WARN|EAL: No free 2048 kB hugepages reported on node 4 2022-09-16T06:58:30.171Z|00032|dpdk|WARN|EAL: No free 2048 kB hugepages reported on node 5 2022-09-16T06:58:30.171Z|00033|dpdk|WARN|EAL: No free 2048 kB hugepages reported on node 6 2022-09-16T06:58:30.171Z|00034|dpdk|WARN|EAL: No free 2048 kB hugepages reported on node 7 2022-09-16T06:58:30.171Z|00035|dpdk|WARN|EAL: No available 2048 kB hugepages reported 2022-09-16T06:58:30.172Z|00036|dpdk|INFO|EAL: VFIO support initialized 2022-09-16T06:58:31.074Z|00037|dpdk|INFO|EAL: Using IOMMU type 1 (Type 1)

igsilya commented 1 year ago

Hi. The symptoms look very similar to a double-lock issue in #259 . Could you try the latest branch-2.17 ?

BigCousin-z commented 1 year ago

Hi. The symptoms look very similar to a double-lock issue in #259 . Could you try the latest branch-2.17 ?

Yes I seen the patch, but Why is the patch not merged into the master or other branches of the ovs ?

igsilya commented 1 year ago

Why is the patch not merged into the master or other branches of the ovs ?

It was merged a week ago right before I replied in this thread. See https://github.com/openvswitch/ovs/commit/586adfd047cbbbf5fae329180adbce1f4f4eb1db .

BigCousin-z commented 1 year ago

Why is the patch not merged into the master or other branches of the ovs ?

It was merged a week ago right before I replied in this thread. See openvswitch/ovs@586adfd .

Yes I have test it

Why is the patch not merged into the master or other branches of the ovs ?

It was merged a week ago right before I replied in this thread. See openvswitch/ovs@586adfd .

Yes the path solves my problem, but one problem is why balance-slb does not cause deadlock?

igsilya commented 1 year ago

It was merged a week ago right before I replied in this thread. See openvswitch/ovs@586adfd .

Yes the path solves my problem, but one problem is why balance-slb does not cause deadlock?

Balance-tcp is using dp_hash + recirculation and dp_hash is generally a 5-tuple hash. Balance-slb is using a source mac address + vlan to determine the bond member. There is no recirculation involved and hence no need to have any hidden OpenFlow rules. Since there is no need to change OpenFlow rules, OVS will not take the ofproto lock.