nigoroll / libvmod-dynamic

The Varnish dns/named director continued
BSD 2-Clause "Simplified" License
95 stars 34 forks source link

Assert error in ref_del(): Condition((r->dir) != NULL) not true #109

Closed delthas closed 7 months ago

delthas commented 7 months ago

I found an assert with a simple minimum working example.

Running against Varnish 7.4.2, against libvmod-dynamic master.

Assert error in ref_del(), vmod_dynamic.c line 427:
  Condition((r->dir) != NULL) not true.
Backtrace:
  ip=0x5555555c669e sp=0x7fffffffba20 <pan_ic+0x34e>
  ip=0x5555556966b5 sp=0x7fffffffbbb0 <VAS_Fail+0x55>
  ip=0x7fffeab0dad7 sp=0x7fffffffbc00 <ref_del+0x1b7>
  ip=0x7fffeab0e120 sp=0x7fffffffbc20 <dom_destroy+0xa0>
  ip=0x5555556023fa sp=0x7fffffffbc50 <vcldir_retire+0x22a>
  ip=0x5555555fd863 sp=0x7fffffffbc70 <vcldir_deref+0x163>
  ip=0x5555555fda2e sp=0x7fffffffbc90 <VRT_Assign_Backend+0x1be>
  ip=0x7fffeaa958ac sp=0x7fffffffbcc0 <vdir_release+0xbc>
  ip=0x7fffeaa98435 sp=0x7fffffffbce0 <vmod_fallback_release+0x135>
  ip=0x5555555fd67c sp=0x7fffffffbd00 <VRT_DelDirector+0x19c>
  ip=0x7fffeaa97a5a sp=0x7fffffffbd30 <vmod_fallback__fini+0xfa>
  ip=0x7fffeaa8c3bf sp=0x7fffffffbd50 <VGC_function_vcl_synth+0x45f>
  ip=0x5555555e15ad sp=0x7fffffffbd70 <vcl_send_event+0x51d>
  ip=0x5555555e082f sp=0x7fffffffbdd0 <VCL_Poll+0x3bf>
  ip=0x5555555e2db3 sp=0x7fffffffbe10 <vcl_cli_discard+0x3d3>
  ip=0x55555569b18c sp=0x7fffffffbe50 <cls_dispatch+0x22c>
  ip=0x55555569ad67 sp=0x7fffffffbe80 <cls_exec+0x3c7>
  ip=0x555555699ad9 sp=0x7fffffffbee0 <cls_feed+0x589>
  ip=0x5555556994bf sp=0x7fffffffbf50 <VCLS_Poll+0x3bf>
  ip=0x555555592704 sp=0x7fffffffdfb0 <CLI_Run+0xf4>
  ip=0x5555555c07ac sp=0x7fffffffdfe0 <child_main+0x42c>
  ip=0x5555556330e9 sp=0x7fffffffe040 <mgt_launch_child+0x6b9>
  ip=0x555555632a01 sp=0x7fffffffe0c0 <MCH_Start_Child+0x11>
  ip=0x55555563aa60 sp=0x7fffffffe0e0 <main+0x1a40>
  ip=0x7ffff7958cd0 sp=0x7fffffffe320 <__libc_init_first+0x90>
  ip=0x7ffff7958d8a sp=0x7fffffffe3c0 <__libc_start_main+0x8a>
  ip=0x55555557ded5 sp=0x7fffffffe410 <_start+0x25>

This happens when discarding a VCL.


To reproduce, here's a failing VTC (timeout, but the process actually crashes)

varnishtest "reload"

shell {
    cat >${tmpdir}/f1 <<-EOF
    vcl 4.1;
    import ${vmod_dynamic};
    import directors;

    backend none none;

    sub vcl_init {
        new dir = directors.fallback();
        new res = dynamic.resolver();
        new dyn = dynamic.director(resolver = res.use());
        dir.add_backend(dyn.backend(host={"example.com"}, port={"80"}) );
    }
    EOF
}

varnish v1 -cliok "vcl.load vcl1 ${tmpdir}/f1"
varnish v1 -cliok "vcl.use vcl1"
varnish v1 -cliok "start"

varnish v1 -cliok "vcl.load vcl2 ${tmpdir}/f1"
varnish v1 -cliok "vcl.use vcl2"
varnish v1 -cliok "vcl.discard vcl1"

varnish v1 -cliok "vcl.load vcl3 ${tmpdir}/f1"
varnish v1 -cliok "vcl.use vcl3"
varnish v1 -cliok "vcl.discard vcl2"

varnish v1 -cliok "vcl.load vcl4 ${tmpdir}/f1"
varnish v1 -cliok "vcl.use vcl4"
varnish v1 -cliok "vcl.discard vcl3"

varnish v1 -cliok "vcl.load vcl5 ${tmpdir}/f1"
varnish v1 -cliok "vcl.use vcl5"
varnish v1 -cliok "vcl.discard vcl4"

You can't see the panic when using the VTC, so I'm instead doing:

crash.vcl:

vcl 4.1;
import dynamic from ".../libvmod-dynamic/src/.libs/libvmod_dynamic.so";
import directors;

backend proforma none;

sub vcl_init {
                new dir = directors.fallback();
                new res = dynamic.resolver();
                new dyn = dynamic.director(resolver = res.use());
                dir.add_backend(dyn.backend(host={"example.com"}, port={"80"}) );
}

reload.sh:

#!/bin/bash
varnishadm vcl.load v1 crash.vcl
varnishadm vcl.use v1
varnishadm vcl.discard boot
for i in $(seq 2 100); do
varnishadm vcl.load v$i crash.vcl
varnishadm vcl.use v$i
varnishadm vcl.discard v$((i-1))
sleep 0.1
done
varnishd -F -j none -p "debug=+vmod_so_keep" -a :12333 -f crash.vcl
reload.sh

And with this I can systematically reproduce the panic and get the trace.

delthas commented 7 months ago

Can also reproduce against Varnish master + libvmod-dynamic master

delthas commented 7 months ago

I investigated further and found the cause: r->dir is NULL because it is not set in the first place.

The precondition is that we are in a cooling state when processing the lookup results (for example, we just loaded and immediately discarded a VCL with a director initialized in its vcl_init).

Then:

Later, when actually discarding the VCL, we go through the refs, and expect to find a ref->dir != NULL, then panic/assert. But the root issue is actually when adding the ref.

I expect that in the case VRT_new_backend returns NULL, we instead skip the ref or something similar.

delthas commented 7 months ago

It looks very similar to https://github.com/nigoroll/libvmod-dynamic/issues/108 (we try to create a director while not warm); but here we are adding while COOLING, whereas in the other ticket I think we are adding while COOL.

nigoroll commented 7 months ago

Thank you fr your work on both tickets. I have seen them and I think the way forward is to get a reference on the vcl while the resolver thread is running. I just need to prioritize other work at the moment.

nigoroll commented 7 months ago

I hope to have fixed this issue and would appreciate feedback.

Note: I believe this can only work with https://github.com/varnishcache/varnish-cache/pull/4037 / https://github.com/varnishcache/varnish-cache/commit/c44bd67f25216746311b3f65ba39552a8971934b in place because we rely on the vcl reference from VRT_VCL_Prevent_Discard() to ensure that we are in VCL_COOLING and not VCL_COLD. Before the aforementioned commit, this order was wrong.