Closed SElsharkawy closed 6 years ago
Thank you for the report.
Can you get a backtrace for the ovs-vswitchd threads?
Here is the ovs-vswitchd thread information after getting into the blocked state: (Blocked thread is handler11)
1 Thread 0x7f28c8444940 (LWP 1736) "ovs-vswitchd" 0x00007f28c7952730 in futex_wait (private=
thread 1 backtrace
thread 2 backtrace
thread 3 backtrace
thread 4 backtrace
Thanks for the backtrace. It helps a little, since clearly the mcast_snooping rwlock is involved. But it's otherwise a pretty poor backtrace because there are hardly any function names or line numbers. Did you do something like stripping debugging symbols?
I suspect what's happening here is that output is recursively trying to take the wrlock but it's really hard to trace it without a good backtrace and I'd like to be able to confirm that.
Probably the easiest solution (if this is correct) is to convert the rwlock to RCU.
I think gdb did not get the symbols file. I will try to generate better backtrack logs.
On Mon, Jul 9, 2018, 7:56 PM Ben Pfaff notifications@github.com wrote:
Thanks for the backtrace. It helps a little, since clearly the mcast_snooping rwlock is involved. But it's otherwise a pretty poor backtrace because there are hardly any function names or line numbers. Did you do something like stripping debugging symbols?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openvswitch/ovs-issues/issues/153#issuecomment-403655758, or mute the thread https://github.com/notifications/unsubscribe-auth/AXLdXNKxrdW13PcDuCKjsRP4F94NHk8Iks5uE-20gaJpZM4VF6tg .
Here is a longer thread backtrace. I hope this is more helpful.
Open vSwitch version: 2.9.2 Linux Kernel: 4.13.0-45-generic
I posted a fix for testing: https://patchwork.ozlabs.org/patch/960078/
The fix has been applied to master and backported as far as branch-2.8.
Open vSwitch (Version 2.9.2) goes into deadlocked state when receiving an ICMPv6 multicast listener report v2 packet on a system port and trying to send it back on the IN_PORT after modifying the VLAN tag. The message is generated as a result of connecting a new device to a physical VLAN access port on a HW switch trunked to the system port on a VTEP virtual bridge which is connected to VXLAN overlay bridge. See attached network diagram for connections' details.
ovs-vswitchd.log:
File: ovs-vswitchd-tail.log
The ovs-vswitchd service is locked waiting for handler thread to quiesce. Here is the log tail:
2018-07-06T17:59:05.086Z|00092|connmgr|INFO|overlay-br<->unix#77: 1 flow_mods in the last 0 s (1 adds) 2018-07-06T17:59:05.109Z|00093|connmgr|INFO|overlay-br<->unix#79: 1 flow_mods in the last 0 s (1 adds) 2018-07-06T17:59:05.126Z|00094|connmgr|INFO|overlay-br<->unix#81: 1 flow_mods in the last 0 s (1 adds) 2018-07-06T17:59:09.940Z|00095|memory|INFO|peak resident set size grew 270% in last 10.0 seconds, from 9564 kB to 35356 kB 2018-07-06T17:59:09.940Z|00096|memory|INFO|handlers:1 ports:18 revalidators:1 rules:64 udpif keys:28 2018-07-06T18:02:15.241Z|00097|ovs_rcu|WARN|blocked 1001 ms waiting for handler11 to quiesce 2018-07-06T18:02:16.241Z|00098|ovs_rcu|WARN|blocked 2001 ms waiting for handler11 to quiesce 2018-07-06T18:02:18.241Z|00099|ovs_rcu|WARN|blocked 4001 ms waiting for handler11 to quiesce 2018-07-06T18:02:21.769Z|00001|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for handler11 to quiesce 2018-07-06T18:02:22.241Z|00100|ovs_rcu|WARN|blocked 8001 ms waiting for handler11 to quiesce 2018-07-06T18:02:22.769Z|00002|ovs_rcu(urcu3)|WARN|blocked 2001 ms waiting for handler11 to quiesce 2018-07-06T18:02:24.768Z|00003|ovs_rcu(urcu3)|WARN|blocked 4000 ms waiting for handler11 to quiesce 2018-07-06T18:02:28.769Z|00004|ovs_rcu(urcu3)|WARN|blocked 8001 ms waiting for handler11 to quiesce
Packet causing problem (anonymized):
14:02:13.930784 f0:XX:XX:XX:16:75 > 33:33:00:00:00:16, ethertype 802.1Q (0x8100), length 94: vlan 2, p 0, ethertype IPv6, (hlim 1, next-header Options (0) payload length: 36) fe80::f2XX:XXff:feXX:1675 > ff02::16: HBH (rtalert: 0x0000) (padn) [icmp6 sum ok] ICMP6, multicast listener report v2, 1 group record(s) [gaddr ff02::1:ff00:0 to_ex { }] 0x0000: 6000 0000 0024 0001 fe80 0000 0000 0000 `....$.......... 0x0010: f2FF XXff feXX 1675 ff02 0000 0000 0000 .......u........ 0x0020: 0000 0000 0000 0016 3a00 0502 0000 0100 ........:....... 0x0030: 8f00 a614 0000 0001 0400 0000 ff02 0000 ................ 0x0040: 0000 0000 0000 0001 ff00 0000 ............
Environment:
Open vSwitch Version: 2.9.2 Linux Kernel: Linux ovs-node-3 4.11.12-041112-generic #201707210350 SMP Fri Jul 21 07:53:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
OVS Commands After Failure:
ovs-dpct show: system@ovs-system: lookups: hit:14046 missed:724 lost:0 flows: 7 masks: hit:18744 total:3 hit/pkt:1.27 port 0: ovs-system (internal) port 1: vtep-br (internal) port 2: eth2 port 3: overlay-br (internal) port 4: vxlan_sys_4789 (vxlan: packet_type=ptap) port 5: ls33 (internal) port 6: ls44 (internal) port 7: c372547724674_l port 8: 98a406d0fe844_l port 9: 4087cc0400cc4_l port 10: 23935f969a5d4_l
ovs-ofctl show overlay-br Command blocks and does not generate any output
ovs-ofctl dump-flows -O OpenFlow13 overlay-br Command blocks and does not generate any output