infinite loop in dynamic_resolve

felipewd commented 7 years ago

Hi,

We're facing an infinite loop in dynamic_resolve:

    do {
            if (next != NULL)
                    next = VTAILQ_NEXT(next, list);
            if (next == NULL)
                    next = VTAILQ_FIRST(&dom->refs);
    } while (next != dom->current &&
        !next->be->dir->healthy(next->be->dir, NULL, NULL));

one possible cause for this is dom->current being NULL in first place.

Only one other function changes the value of dom->current and could make this condition possible, it's dynamic_del:

    if (r == dom->current)
            dom->current = VTAILQ_NEXT(r, list);

Since this part of the code doesn't have the protection to go back do VTAILQ_FIRST in case NEXT is NULL.

We'll submit a patch to fix this.

dridi commented 7 years ago

The patch in #8 is very simple and I'd like you to roll with it for a while to make sure it actually fixes the infinite loop. The fix is a bit surprising because the backend resolution is guarded by the domain's lock that is needed when the backend is deleted...

...except at discard time, when by definition there is no ongoing HTTP transaction, so no resolution.

Thanks again for contributing so much to this project!

felipewd commented 7 years ago

Sure thing, will run this for a few days to make sure this is fixed and report back.

dridi commented 7 years ago

It's been a week, have you experienced this problem again? I guess not.

There's a branch [1] I'd like you to test, it's a different fix for the deadlock you reported. You can apply the patch #8 on top of it, of course, but I'd like real-life feedback on how it performs. Basically I reviewed locking and found that the origin of the deadlock was rather an unneeded lock. There's no need to bail out like in your original patch #6.

[1] https://github.com/nigoroll/libvmod-dynamic/tree/fix/deadlock

felipewd commented 7 years ago

We indeed are quite happy with Varnish+dynamic now.

We had to apply one other patch, though:

https://github.com/nigoroll/libvmod-dynamic/pull/8/commits/5c1ec6ef958dedf0b58f9dec4ec84f9db29b3ce0

This fixed both issues: When deleting a backend and when looking up a healthy one with dom->current = NULL

About your patch, I can test it, sure. But your patch seems to be about the deadlock, not the infinite loop, right? These are separate issues for us.

dridi commented 7 years ago

Yes, it's a different fix for the deadlock you reported. I'd like you to give it a try with your infinite loop patch on top of it.

felipewd commented 7 years ago

I'm trying this branch (with our loop patch) this afternoon. Will report back in a few days.

felipewd commented 7 years ago

Hi Dridi. Your fix/deadlock branch (plus our #8 commit) is stable in production. There seems to be yet another infinite loop (hard to trigger, though, and not related to changing VCLs), but like @rnsanchez pointed in his commit, we're trying other possible ideas to catch that.

dridi commented 7 years ago

I was finally able to reproduce this problem, and although your fix in #8 strives to never let the current field be NULL, you can't actually prevent it in one case: when no backends are resolved.

To reproduce it I needed to delete the last backend from a domain while all other backends from the domain were sick. I can't write a test case until I have a reliable way to automate it. It's easy to make backends sick using the CLI for instance, but the rest is a lot less trivial.

I'm reopening this ticket in the hope that you will run the current master branch and report it's stable!

rnsanchez commented 7 years ago

We will most likely roll it out in the next update window.

Thank you for considering the PR! :-)

rnsanchez commented 7 years ago

OK after some days, we have been hitting a slightly different cenario: things will go fine pretty much all the time until, suddenly, at least one thread will get stuck (under sys-time, as opposed to user-time as before) for just about 10 seconds (it is actually less, but that's as best as we can get right now), and Varnish will SIGQUIT it.

Varnish is as current as ab7c539ac40f12091adfb2fa00876dbeb0feff80 (post-4.1.4).

We trapped this SIGQUIT and forced a backtrace:

Last panic at: Wed, 21 Dec 2016 21:28:45 GMT
"Assert error in child_sigquit_handler(), mgt/mgt_child.c line 307:
  Condition(SIGQUIT at 0x93f1) not true.
thread = (cache-main)
version = varnish-4.1.4 revision a4cc30f
now = 14619335.301214 (mono), 1482355724.864580 (real)
Backtrace:
  0x43e6d0: pan_backtrace+0x1d
  0x43eb27: pan_ic+0x2f3
  0x4690c6: mgt_launch_child
  0x7f16c663c3b0: libpthread.so.0(+0x113b0) [0x7f16c663c3b0]
  0x7f16c663b28c: libpthread.so.0(+0x1028c) [0x7f16c663b28c]
  0x7f16c6634d35: libpthread.so.0(pthread_mutex_lock+0x75) [0x7f16c6634d35]
  0x437fa8: Lck__Lock+0xae
  0x7a07ad9fa46d: libvmod_dynamic.so(vmod_event+0x23d) [0x7a07ad9fa46d]
  0x7ccc91d996e9: vgc.so(+0x2a6e9) [0x7ccc91d996e9]
  0x44d354: vcl_failsafe_event+0x223
"

Source-disassembly intermix:

libvmod-dynamic/src/vmod_dynamic.c:533
                AZ(pthread_join(dom->thread, NULL));
    3429:       81 7d 00 d1 7f 3e 8a    cmpl   $0x8a3e7fd1,0x0(%rbp)
    3430:       0f 85 5e 06 00 00       jne    3a94 <vmod_event+0x864>
libvmod-dynamic/src/vmod_dynamic.c:542
                VTAILQ_REMOVE(&dom->obj->purged_domains, dom, list);
        }

        INIT_OBJ(&ctx, VRT_CTX_MAGIC);
        ctx.vcl = obj->vcl;
        VRT_rel_vcl(&ctx, &obj->vclref);
    3436:       4c 8b 7d 78             mov    0x78(%rbp),%r15
    343a:       4d 85 ff                test   %r15,%r15
    343d:       0f 84 c1 00 00 00       je     3504 <vmod_event+0x2d4>
    3443:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
libvmod-dynamic/src/vmod_dynamic.c:543
        Lck_Unlock(&obj->mtx);
    3448:       41 81 3f 45 13 fe 1b    cmpl   $0x1bfe1345,(%r15)
    344f:       0f 85 b9 02 00 00       jne    370e <vmod_event+0x4de>
libvmod-dynamic/src/vmod_dynamic.c:544
}
    3455:       49 8d 5f 18             lea    0x18(%r15),%rbx
    3459:       48 8d 35 b8 22 00 00    lea    0x22b8(%rip),%rsi       # 5718 <__func__.8678>
    3460:       ba 20 02 00 00          mov    $0x220,%edx
    3465:       48 89 df                mov    %rbx,%rdi
    3468:       e8 53 e1 ff ff          callq  15c0 <Lck__Lock@plt>
libvmod-dynamic/src/vmod_dynamic.c:545

    346d:       49 83 7f 10 00          cmpq   $0x0,0x10(%r15)   <===== return address in the backtrace
    3472:       0f 84 bf 02 00 00       je     3737 <vmod_event+0x507>
libvmod-dynamic/src/vmod_dynamic.c:546
static void
    3478:       49 8d 7f 20             lea    0x20(%r15),%rdi
    347c:       e8 ef e0 ff ff          callq  1570 <pthread_cond_signal@plt>
    3481:       85 c0                   test   %eax,%eax
    3483:       0f 85 d7 02 00 00       jne    3760 <vmod_event+0x530>
libvmod-dynamic/src/vmod_dynamic.c:547
dynamic_start(struct vmod_dynamic_director *obj)
    3489:       48 8d 35 88 22 00 00    lea    0x2288(%rip),%rsi       # 5718 <__func__.8678>
    3490:       ba 23 02 00 00          mov    $0x223,%edx
    3495:       48 89 df                mov    %rbx,%rdi
    3498:       e8 43 e1 ff ff          callq  15e0 <Lck__Unlock@plt>
libvmod-dynamic/src/vmod_dynamic.c:542
                VTAILQ_REMOVE(&dom->obj->purged_domains, dom, list);
        }

        INIT_OBJ(&ctx, VRT_CTX_MAGIC);
        ctx.vcl = obj->vcl;
        VRT_rel_vcl(&ctx, &obj->vclref);
    349d:       4d 8b bf 88 00 00 00    mov    0x88(%r15),%r15
    34a4:       4d 85 ff                test   %r15,%r15
    34a7:       75 9f                   jne    3448 <vmod_event+0x218>

Hope this helps. Let us know if you need more information.

dridi commented 7 years ago

Thanks for the detailed report. I'm a bit confused regarding how I should read this. I see vmod_event in the backtrace so this is the CLI thread, although it may not be related to the CLI but simply because this thread happens to catch the SIGQUIT event sent by the manager process.

At first glance it looks like a deadlock, but the backtrace can't be right, it says that vmod_event locks but it's not the case. It will call functions that lock for WARM and COLD events, but they should appear in the stack (unless they got in-lined for some reason, but they don't look like good candidates.)

The assembly listing is confusing, line numbers don't seem to match. Are you running master or do you still have patches on top of it?

rnsanchez commented 7 years ago

We tried for a while our branch, but for the most part we are running master.

The backtraces will only give us snippets of what might be going on (like a sampling profiler). This is the usual backtrace (happened a couple hours ago, and by far the most frequent we see):

Backtrace:
  0x43e6d0: pan_backtrace+0x1d
  0x43eb27: pan_ic+0x2f3
  0x4690c6: mgt_launch_child
  0x7f124bbc23b0: libpthread.so.0(+0x113b0) [0x7f124bbc23b0]
  0x414649: VBE_Poll+0x3a
  0x41cfbf: cli_cb_before+0x71
  0x7f124d1ef0b4: libvarnish.so(+0x80b4) [0x7f124d1ef0b4]
  0x7f124d1ef690: libvarnish.so(+0x8690) [0x7f124d1ef690]
  0x7f124d1f6682: libvarnish.so(+0xf682) [0x7f124d1f6682]
  0x7f124d1f6849: libvarnish.so(VLU_Fd+0xe3) [0x7f124d1f6849]

Note that the vmod does not appear in this one; the one with the vmod is the lucky one. I can't say for sure whether the backtrace featuring the vmod was with master or the branch, though. To me it looks like it was master (just checked again). The code is compiled with optimizations, thus part of the surprising intermix of code (inlines expanded, code hoisted, etc). The inlining of static/single-call (leaf-call optimization, maybe) functions is common in GCC, so this does not surprise me.

Apologies if this is already known, but the core of the trace is pointing to return addresses as if the function were to resume running.

We seem to have 2 cenarios still pointing to the same problem: 2 minutes of user-time (the backtrace above), and the other one with just shy of 10 seconds of sys-time. Both of these are leading the parent process to SIGQUIT the child. It still feels as a deadlock or lock-order reversal in some odd way.

Since this runs in production, the tooling we can use is somewhat limited. If you have a test case (varnishtest) that can shake these locks, we can try to reproduce the problem or at least help out in investigating this issue.

dridi commented 7 years ago

The problem with varnishtest is that it falls short when it comes down to concurrency problems. We have means of synchronization with worker threads when they are running in the VCL engine but that's about it. We also have means to avoid concurrency, but nothing for fine-grained stressing since we test Varnish outside-in.

The best we can probably do is to stress a varnish instance with a short-TTL director, and make frequent VCL and DNS changes.

rnsanchez commented 7 years ago

OK, so basically our production cenario. :-)

dridi commented 7 years ago

I hope not, I have a subsecond TTL in mind, and changes happening every other second or so.

nigoroll commented 7 years ago

@felipewd @rnsanchez Dridi cannot continue working on this, can you please let me know if this issue still needs attention?

rnsanchez commented 7 years ago

AFAIK, no.

It has been rolled out of production for the time being.

nigoroll / libvmod-dynamic

infinite loop in dynamic_resolve #7