Closed felipewd closed 6 years ago
The patch in #8 is very simple and I'd like you to roll with it for a while to make sure it actually fixes the infinite loop. The fix is a bit surprising because the backend resolution is guarded by the domain's lock that is needed when the backend is deleted...
...except at discard time, when by definition there is no ongoing HTTP transaction, so no resolution.
Thanks again for contributing so much to this project!
Sure thing, will run this for a few days to make sure this is fixed and report back.
It's been a week, have you experienced this problem again? I guess not.
There's a branch [1] I'd like you to test, it's a different fix for the deadlock you reported. You can apply the patch #8 on top of it, of course, but I'd like real-life feedback on how it performs. Basically I reviewed locking and found that the origin of the deadlock was rather an unneeded lock. There's no need to bail out like in your original patch #6.
[1] https://github.com/nigoroll/libvmod-dynamic/tree/fix/deadlock
We indeed are quite happy with Varnish+dynamic now.
We had to apply one other patch, though:
https://github.com/nigoroll/libvmod-dynamic/pull/8/commits/5c1ec6ef958dedf0b58f9dec4ec84f9db29b3ce0
This fixed both issues: When deleting a backend and when looking up a healthy one with dom->current = NULL
About your patch, I can test it, sure. But your patch seems to be about the deadlock, not the infinite loop, right? These are separate issues for us.
Yes, it's a different fix for the deadlock you reported. I'd like you to give it a try with your infinite loop patch on top of it.
I'm trying this branch (with our loop patch) this afternoon. Will report back in a few days.
Hi Dridi. Your fix/deadlock branch (plus our #8 commit) is stable in production. There seems to be yet another infinite loop (hard to trigger, though, and not related to changing VCLs), but like @rnsanchez pointed in his commit, we're trying other possible ideas to catch that.
I was finally able to reproduce this problem, and although your fix in #8 strives to never let the current
field be NULL
, you can't actually prevent it in one case: when no backends are resolved.
To reproduce it I needed to delete the last backend from a domain while all other backends from the domain were sick. I can't write a test case until I have a reliable way to automate it. It's easy to make backends sick using the CLI for instance, but the rest is a lot less trivial.
I'm reopening this ticket in the hope that you will run the current master branch and report it's stable!
We will most likely roll it out in the next update window.
Thank you for considering the PR! :-)
OK after some days, we have been hitting a slightly different cenario: things will go fine pretty much all the time until, suddenly, at least one thread will get stuck (under sys-time, as opposed to user-time as before) for just about 10 seconds (it is actually less, but that's as best as we can get right now), and Varnish will SIGQUIT it.
Varnish is as current as ab7c539ac40f12091adfb2fa00876dbeb0feff80 (post-4.1.4).
We trapped this SIGQUIT and forced a backtrace:
Last panic at: Wed, 21 Dec 2016 21:28:45 GMT
"Assert error in child_sigquit_handler(), mgt/mgt_child.c line 307:
Condition(SIGQUIT at 0x93f1) not true.
thread = (cache-main)
version = varnish-4.1.4 revision a4cc30f
now = 14619335.301214 (mono), 1482355724.864580 (real)
Backtrace:
0x43e6d0: pan_backtrace+0x1d
0x43eb27: pan_ic+0x2f3
0x4690c6: mgt_launch_child
0x7f16c663c3b0: libpthread.so.0(+0x113b0) [0x7f16c663c3b0]
0x7f16c663b28c: libpthread.so.0(+0x1028c) [0x7f16c663b28c]
0x7f16c6634d35: libpthread.so.0(pthread_mutex_lock+0x75) [0x7f16c6634d35]
0x437fa8: Lck__Lock+0xae
0x7a07ad9fa46d: libvmod_dynamic.so(vmod_event+0x23d) [0x7a07ad9fa46d]
0x7ccc91d996e9: vgc.so(+0x2a6e9) [0x7ccc91d996e9]
0x44d354: vcl_failsafe_event+0x223
"
Source-disassembly intermix:
libvmod-dynamic/src/vmod_dynamic.c:533
AZ(pthread_join(dom->thread, NULL));
3429: 81 7d 00 d1 7f 3e 8a cmpl $0x8a3e7fd1,0x0(%rbp)
3430: 0f 85 5e 06 00 00 jne 3a94 <vmod_event+0x864>
libvmod-dynamic/src/vmod_dynamic.c:542
VTAILQ_REMOVE(&dom->obj->purged_domains, dom, list);
}
INIT_OBJ(&ctx, VRT_CTX_MAGIC);
ctx.vcl = obj->vcl;
VRT_rel_vcl(&ctx, &obj->vclref);
3436: 4c 8b 7d 78 mov 0x78(%rbp),%r15
343a: 4d 85 ff test %r15,%r15
343d: 0f 84 c1 00 00 00 je 3504 <vmod_event+0x2d4>
3443: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
libvmod-dynamic/src/vmod_dynamic.c:543
Lck_Unlock(&obj->mtx);
3448: 41 81 3f 45 13 fe 1b cmpl $0x1bfe1345,(%r15)
344f: 0f 85 b9 02 00 00 jne 370e <vmod_event+0x4de>
libvmod-dynamic/src/vmod_dynamic.c:544
}
3455: 49 8d 5f 18 lea 0x18(%r15),%rbx
3459: 48 8d 35 b8 22 00 00 lea 0x22b8(%rip),%rsi # 5718 <__func__.8678>
3460: ba 20 02 00 00 mov $0x220,%edx
3465: 48 89 df mov %rbx,%rdi
3468: e8 53 e1 ff ff callq 15c0 <Lck__Lock@plt>
libvmod-dynamic/src/vmod_dynamic.c:545
346d: 49 83 7f 10 00 cmpq $0x0,0x10(%r15) <===== return address in the backtrace
3472: 0f 84 bf 02 00 00 je 3737 <vmod_event+0x507>
libvmod-dynamic/src/vmod_dynamic.c:546
static void
3478: 49 8d 7f 20 lea 0x20(%r15),%rdi
347c: e8 ef e0 ff ff callq 1570 <pthread_cond_signal@plt>
3481: 85 c0 test %eax,%eax
3483: 0f 85 d7 02 00 00 jne 3760 <vmod_event+0x530>
libvmod-dynamic/src/vmod_dynamic.c:547
dynamic_start(struct vmod_dynamic_director *obj)
3489: 48 8d 35 88 22 00 00 lea 0x2288(%rip),%rsi # 5718 <__func__.8678>
3490: ba 23 02 00 00 mov $0x223,%edx
3495: 48 89 df mov %rbx,%rdi
3498: e8 43 e1 ff ff callq 15e0 <Lck__Unlock@plt>
libvmod-dynamic/src/vmod_dynamic.c:542
VTAILQ_REMOVE(&dom->obj->purged_domains, dom, list);
}
INIT_OBJ(&ctx, VRT_CTX_MAGIC);
ctx.vcl = obj->vcl;
VRT_rel_vcl(&ctx, &obj->vclref);
349d: 4d 8b bf 88 00 00 00 mov 0x88(%r15),%r15
34a4: 4d 85 ff test %r15,%r15
34a7: 75 9f jne 3448 <vmod_event+0x218>
Hope this helps. Let us know if you need more information.
Thanks for the detailed report. I'm a bit confused regarding how I should read this. I see vmod_event
in the backtrace so this is the CLI thread, although it may not be related to the CLI but simply because this thread happens to catch the SIGQUIT event sent by the manager process.
At first glance it looks like a deadlock, but the backtrace can't be right, it says that vmod_event
locks but it's not the case. It will call functions that lock for WARM and COLD events, but they should appear in the stack (unless they got in-lined for some reason, but they don't look like good candidates.)
The assembly listing is confusing, line numbers don't seem to match. Are you running master or do you still have patches on top of it?
We tried for a while our branch, but for the most part we are running master.
The backtraces will only give us snippets of what might be going on (like a sampling profiler). This is the usual backtrace (happened a couple hours ago, and by far the most frequent we see):
Backtrace:
0x43e6d0: pan_backtrace+0x1d
0x43eb27: pan_ic+0x2f3
0x4690c6: mgt_launch_child
0x7f124bbc23b0: libpthread.so.0(+0x113b0) [0x7f124bbc23b0]
0x414649: VBE_Poll+0x3a
0x41cfbf: cli_cb_before+0x71
0x7f124d1ef0b4: libvarnish.so(+0x80b4) [0x7f124d1ef0b4]
0x7f124d1ef690: libvarnish.so(+0x8690) [0x7f124d1ef690]
0x7f124d1f6682: libvarnish.so(+0xf682) [0x7f124d1f6682]
0x7f124d1f6849: libvarnish.so(VLU_Fd+0xe3) [0x7f124d1f6849]
Note that the vmod does not appear in this one; the one with the vmod is the lucky one. I can't say for sure whether the backtrace featuring the vmod was with master or the branch, though. To me it looks like it was master (just checked again). The code is compiled with optimizations, thus part of the surprising intermix of code (inlines expanded, code hoisted, etc). The inlining of static/single-call (leaf-call optimization, maybe) functions is common in GCC, so this does not surprise me.
Apologies if this is already known, but the core of the trace is pointing to return addresses as if the function were to resume running.
We seem to have 2 cenarios still pointing to the same problem: 2 minutes of user-time (the backtrace above), and the other one with just shy of 10 seconds of sys-time. Both of these are leading the parent process to SIGQUIT the child. It still feels as a deadlock or lock-order reversal in some odd way.
Since this runs in production, the tooling we can use is somewhat limited. If you have a test case (varnishtest
) that can shake these locks, we can try to reproduce the problem or at least help out in investigating this issue.
The problem with varnishtest
is that it falls short when it comes down to concurrency problems. We have means of synchronization with worker threads when they are running in the VCL engine but that's about it. We also have means to avoid concurrency, but nothing for fine-grained stressing since we test Varnish outside-in.
The best we can probably do is to stress a varnish instance with a short-TTL director, and make frequent VCL and DNS changes.
OK, so basically our production cenario. :-)
I hope not, I have a subsecond TTL in mind, and changes happening every other second or so.
@felipewd @rnsanchez Dridi cannot continue working on this, can you please let me know if this issue still needs attention?
AFAIK, no.
It has been rolled out of production for the time being.
Hi,
We're facing an infinite loop in dynamic_resolve:
one possible cause for this is dom->current being NULL in first place.
Only one other function changes the value of dom->current and could make this condition possible, it's dynamic_del:
Since this part of the code doesn't have the protection to go back do VTAILQ_FIRST in case NEXT is NULL.
We'll submit a patch to fix this.