Closed jharbott closed 4 years ago
i also had a deadlock but sadly i don't know how to debug it. everything went back to normal after restart.
openvswitch-common 2.11.0-0ubuntu2~cloud0
openvswitch-switch 2.11.0-0ubuntu2~cloud0
Linux cj204 5.0.0-29-generic #31~18.04.1-Ubuntu SMP Thu Sep 12 18:29:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
2019-11-12T12:51:03.233Z|00446|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-11-12T17:35:24.902Z|00001|ovs_rcu(urcu4)|WARN|blocked 1000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:25.902Z|00002|ovs_rcu(urcu4)|WARN|blocked 2000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:27.168Z|00447|ovs_rcu|WARN|blocked 1000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:27.902Z|00003|ovs_rcu(urcu4)|WARN|blocked 4000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:28.167Z|00448|ovs_rcu|WARN|blocked 2000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:30.167Z|00449|ovs_rcu|WARN|blocked 4000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:31.901Z|00004|ovs_rcu(urcu4)|WARN|blocked 8000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:34.167Z|00450|ovs_rcu|WARN|blocked 8000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:39.901Z|00005|ovs_rcu(urcu4)|WARN|blocked 16000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:42.168Z|00451|ovs_rcu|WARN|blocked 16001 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:55.902Z|00006|ovs_rcu(urcu4)|WARN|blocked 32001 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:58.168Z|00452|ovs_rcu|WARN|blocked 32001 ms waiting for revalidator95 to quiesce
2019-11-12T17:36:27.902Z|00007|ovs_rcu(urcu4)|WARN|blocked 64000 ms waiting for revalidator95 to quiesce
2019-11-12T17:36:30.167Z|00453|ovs_rcu|WARN|blocked 64000 ms waiting for revalidator95 to quiesce
2019-11-12T17:37:31.902Z|00008|ovs_rcu(urcu4)|WARN|blocked 128000 ms waiting for revalidator95 to quiesce
2019-11-12T17:37:34.168Z|00454|ovs_rcu|WARN|blocked 128000 ms waiting for revalidator95 to quiesce
2019-11-12T17:39:39.901Z|00009|ovs_rcu(urcu4)|WARN|blocked 256000 ms waiting for revalidator95 to quiesce
2019-11-12T17:39:42.168Z|00455|ovs_rcu|WARN|blocked 256001 ms waiting for revalidator95 to quiesce
2019-11-12T17:43:55.904Z|00010|ovs_rcu(urcu4)|WARN|blocked 512003 ms waiting for revalidator95 to quiesce
2019-11-12T17:43:58.172Z|00456|ovs_rcu|WARN|blocked 512004 ms waiting for revalidator95 to quiesce
2019-11-12T17:52:27.902Z|00011|ovs_rcu(urcu4)|WARN|blocked 1024000 ms waiting for revalidator95 to quiesce
2019-11-12T17:52:30.168Z|00457|ovs_rcu|WARN|blocked 1024000 ms waiting for revalidator95 to quiesce
2019-11-12T18:09:31.908Z|00012|ovs_rcu(urcu4)|WARN|blocked 2048007 ms waiting for revalidator95 to quiesce
2019-11-12T18:09:34.177Z|00458|ovs_rcu|WARN|blocked 2048009 ms waiting for revalidator95 to quiesce
2019-11-12T18:43:39.942Z|00013|ovs_rcu(urcu4)|WARN|blocked 4096040 ms waiting for revalidator95 to quiesce
2019-11-12T18:43:42.210Z|00459|ovs_rcu|WARN|blocked 4096042 ms waiting for revalidator95 to quiesce
2019-11-12T19:51:56.000Z|00014|ovs_rcu(urcu4)|WARN|blocked 8192099 ms waiting for revalidator95 to quiesce
2019-11-12T19:51:58.268Z|00460|ovs_rcu|WARN|blocked 8192100 ms waiting for revalidator95 to quiesce
2019-11-12T22:08:28.002Z|00015|ovs_rcu(urcu4)|WARN|blocked 16384100 ms waiting for revalidator95 to quiesce
2019-11-12T22:08:30.267Z|00461|ovs_rcu|WARN|blocked 16384100 ms waiting for revalidator95 to quiesce
2019-11-13T02:41:32.001Z|00016|ovs_rcu(urcu4)|WARN|blocked 32768099 ms waiting for revalidator95 to quiesce
2019-11-13T02:41:34.230Z|00462|ovs_rcu|WARN|blocked 32768062 ms waiting for revalidator95 to quiesce
I can confirm this bug, manifestation is exactly same. We can reproduce this bug easily, as it occurs several times a day. It is also in Openstack environment after update to Bionic, we are at 2.12.0-0ubuntu1 (compiled from latest ubuntu dsc) and 5.0.0-31-generic. Nothing helped so far. Stucked thread:
40 Thread 0x7fb88f7fe700 (LWP 4457) "revalidator519" 0x00007fb91f544bf9 in __GI___poll (fds=fds@entry=0x7fb8cc029150, nfds=nfds@entry=2, timeout=timeout@entry=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29 41 Thread 0x7fb88effd700 (LWP 4458) "revalidator520" 0x00007fb92007d37b in futex_abstimed_wait (private=<optimized out>, abstime=0x0, expected=10, futex_word=0x5614a8a33378) at ../sysdeps/unix/sysv/linux/futex-internal.h:172 42 Thread 0x7fb88e7fc700 (LWP 4459) "revalidator521" 0x00007fb91f544bf9 in __GI___poll (fds=fds@entry=0x7fb8b412f6e0, nfds=nfds@entry=2, timeout=timeout@entry=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
Backtrace of stucked thread
` Thread 41 (Thread 0x7fb88effd700 (LWP 4458)):
__ret = -512
err = <optimized out>
err = <optimized out>
__ret = <optimized out>
resultvar = <optimized out>
__arg6 = <optimized out>
__arg5 = <optimized out>
__arg4 = <optimized out>
__arg3 = <optimized out>
__arg2 = <optimized out>
__arg1 = <optimized out>
_a6 = <optimized out>
_a5 = <optimized out>
_a4 = <optimized out>
_a3 = <optimized out>
_a2 = <optimized out>
_a1 = <optimized out>
private = <optimized out>
err = <optimized out>
r = 10
wpf = <optimized out>
ready = <optimized out>
r = <optimized out>
wpf = <optimized out>
ready = <optimized out>
__value = <optimized out>
private = <optimized out>
err = <optimized out>
private = <optimized out>
private = <optimized out>
err = <optimized out>
result = <optimized out>
l = 0x5614a8a33378
error = <optimized out>
__func__ = "ovs_rwlock_rdlock_at"
in_xbundle = <optimized out>
mac = <optimized out>
xvlan = {v = {{tpid = 33024, vid = 290, pcp = 0}, {tpid = 0, vid = 0, pcp = 0}}}
wc = 0x7fb88eff8750
flow = 0x7fb88eff8100
in_port = 0x5614a8ddb840
mac_port = <optimized out>
in_xvlan = {v = {{tpid = 33024, vid = 290, pcp = 0}, {tpid = 0, vid = 0, pcp = 0}}}
vlan = 290
is_grat_arp = <optimized out>
wc = <optimized out>
flow = <optimized out>
in_xbundle = <optimized out>
in_port = <optimized out>
mac = <optimized out>
mac_port = <optimized out>
in_xvlan = <optimized out>
xvlan = <optimized out>
vlan = <optimized out>
is_grat_arp = <optimized out>
__v = <optimized out>
__x = <optimized out>
__v = <optimized out>
__x = <optimized out>
entry = <optimized out>
ms = <optimized out>
grp = <optimized out>
out = <optimized out>
out = <optimized out>
out = <optimized out>
__v = <optimized out>
__x = <optimized out>
__v = <optimized out>
__x = <optimized out>
mac_xbundle = <optimized out>
prev_nf_output_iface = 65533
controller = <optimized out>
last = <optimized out>
mf = <optimized out>
set_field = <optimized out>
metadata = <optimized out>
__v = <optimized out>
__v = <optimized out>
__v = <optimized out>
__v = <optimized out>
__v = <optimized out>
__v = <optimized out>
__v = <optimized out>
__v = <optimized out>
__v = <optimized out>
__v = <optimized out>
__v = <optimized out>
__v = <optimized out>
__x = <optimized out>
__x = <optimized out>
__x = <optimized out>
__x = <optimized out>
__x = <optimized out>
__x = <optimized out>
__x = <optimized out>
__x = <optimized out>
__x = <optimized out>
__x = <optimized out>
__x = <optimized out>
__x = <optimized out>
wc = <optimized out>
flow = <optimized out>
a = 0x5614a8c05f08
__func__ = "do_xlate_actions"
old_stack = {base = 0x7fb88eff5560, data = 0x7fb88eff5560, size = 0, allocated = 1024, header = 0x0, msg = <optimized out>, list_node = {prev = <optimized out>, next = <optimized out>}, source = <optimized out>}
new_stack = {{u8 = '\000' <repeats 127 times>, be16 = {0 <repeats 64 times>}, be32 = {0 <repeats 32 times>}, be64 = {0 <repeats 16 times>}, be128 = {{be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}},
`
Reported to mailing list of OVS as well, attached full backtrace. Now, any idea what we can do about this by first shot ? We can definitively reproduce, bug is occuring pretty frequently. From what i see, it seems thread is blocked for now apparent reason. What is also very strange is a way how we mitigate this problem - we simply do gcore coredump of ovs-vswitchd, which unblocks stucked thread and ovs-vswitchd continues to work normally. Even with this, its extremely annoying, especialy if you hit this while creating many VMs, etc... Anyone have suggestion on what we can do now with this ?
Also to note, we tested 4.x kernels and even lower versions of OVS, nothing seems to help. So the problem will most propably be interaction with Ubuntu Bionic and some C libraries OVS is using. Will need to get some help from somebody faimiliar with OVS code to figure out what is going on here - we are ready to patch and test, but so far no idea where to begin
@jharbott @ionutbiru anything new regards this from you ?
This issue might be related and it seems that it not fixed yet in Ubuntu 18.04: https://sourceware.org/bugzilla/show_bug.cgi?id=23844
@igsilya thanks for the lead, from what i see this can be the most likely cause. I will see about testing the patch in our environment.
@igsilya @zdenekjanda That looks pretty interesting and would match with what we see here. I'll also trigger some testing at our side, but it will take a couple of days.
So I did some testing with the patch for the glibc bug added to the Ubuntu Bionic version of it (2.27-3ubuntu1), but it did not resolve the ovs lockups for us. Going to try with newer glibc versions now, maybe there were some similar issues that got fixed.
@jharbott i have tested the same and it didnt resolve the lockup. But, 2 days ago we have deployed 2.30-0ubuntu2 which is first version to have the patch in natively (brave step, added repos from eoan and installed libc-dev-bin libc6-dbg libc6-dev multiarch-support libc-bin libc6 libidn2-0 locales), and since then no occurences of the bug (before, we had stable like 10+ occurences a day). Issue is resolved, indeed its not just this patch but also somethin else, definitively i strongly sugest to avoid openvswitch with glibc 2.27-2.29. Will keep it few more days to see if it would appear again or if there are any issues with such mixed system.
@zdenekjanda that is good news. We started a similar test with 2.29-0ubuntu2 from disco before I read your reply, we'll see on Monday whether that might already be good enough. If not, we will switch to eoan like you did.
So 2.29-0ubuntu2 solves the issue for us, too. With that, I'm closing this issue, as it isn't caused by ovs, will work with Ubuntu to try and get the fix(es) into Bionic.
Do you have pretty busy network traffic? we seem to also hit by the same issue on production, but cannot reproduce it in dev env. Curious to find the trigger....
https://sourceware.org/bugzilla/show_bug.cgi?id=23861 I think this bug report is pretty interesting....
This is similar to https://github.com/openvswitch/ovs-issues/issues/153 but the backtrace that we are seeing is slightly different, happening on the "else" branch and thus not with multicast traffic. We haven't found a way to trigger this, but are seeing irregular occurances since we upgraded our OpenStack setup from Xenial to Bionic. The version of openvswitch is unchanged by this upgrade though, since on Xenial it was pulled in from the Queens UCA: 2.9.2-0ubuntu0.18.04.3, kernel 4.15.0-55.60.