nigoroll / libvmod-dynamic

The Varnish dns/named director continued
BSD 2-Clause "Simplified" License
95 stars 34 forks source link

Assert error in dom_release(): Condition((dom->thread) == 0) not true #107

Closed delthas closed 8 months ago

delthas commented 8 months ago

Hi,

(Following our discussion on IRC, I'm opening an issue.)

I got a panic in libvmod-dynamic after reloading Varnish (varnishreload, so loading a new VCL, using it, and discarding an older one), running https://github.com/nigoroll/libvmod-dynamic/commit/2d51da74680e849c8411bcd6d85cd53c149b3052 against Varnish 7.3.0.

Looks like the panic is caused when handling a VCL_EVENT_COLD event, when releasing a dynamic director.

Panic annotated with some locations:

Assert error in dom_release(), vmod_dynamic.c line 952:
  Condition((dom->thread) == 0) not true.
Backtrace:
  0x5609d924f872: /usr/sbin/varnishd(+0x5b872) [0x5609d924f872] -> pan_ic
  0x5609d92cc745: /usr/sbin/varnishd(VAS_Fail+0x45) [0x5609d92cc745]
  0x7564a4065fe2: ./vmod_cache/_vmod_dynamic.29b766e616262ab0ef9252419367ce48fbf642168ba34fc382b1742a77dac5c4(+0x4fe2) [0x7564a4065fe2] -> (dom_destroy ?) -> dom_release
  0x5609d927186a: /usr/sbin/varnishd(+0x7d86a) [0x5609d927186a] -> vcldir_deref
  0x7564a40662fa: ./vmod_cache/_vmod_dynamic.29b766e616262ab0ef9252419367ce48fbf642168ba34fc382b1742a77dac5c4(+0x52fa) [0x7564a40662fa] (? guessed from the code) dynamic_stop -> dynamic_gc_expired -> dom_free -> VRT_DelDirector
  0x7564a4069156: ./vmod_cache/_vmod_dynamic.29b766e616262ab0ef9252419367ce48fbf642168ba34fc382b1742a77dac5c4(vmod_event+0x176) [0x7564a4069156] -> vmod_event (VCL_EVENT_COLD)
  0x754e303bc5b5: vcl_reload_20231121_143229_2498323.1700577149.771996/vgc.so(+0x305b5) [0x754e303bc5b5]
  0x5609d925fd71: /usr/sbin/varnishd(+0x6bd71) [0x5609d925fd71] -> vcl_send_event (VCL_EVENT_COLD)
  0x5609d926027d: /usr/sbin/varnishd(+0x6c27d) [0x5609d926027d] -> vcl_set_state (0)
  0x5609d9261fb8: /usr/sbin/varnishd(VCL_Poll+0xe8) [0x5609d9261fb8] -> VCL_Poll

According to our discussion on IRC; this could have been solved by https://github.com/nigoroll/libvmod-dynamic/commit/b54c48849a0ffb940f87d641c0b13584cb21540e, but I'm running a version after that commit.

delthas commented 8 months ago

Here's my analysis:

Precondition: a dynamic domain is in unref_domains (expired but not yet gc'd):

The stack trace is:

So the issue is that when calling VRT_DelDirector while handling a VCL cold event, the VCL is already set to not warm, so Varnish will not send a cold event to the director, and instead directly calls its destroy callback.


I'm able to consistently reproduce the panic by calling .backend() once, waiting for the domain to timeout (without resolving the director at all), then reloading/discarding the VCL.

delthas commented 8 months ago

I was able to reproduce on latest master as well.

nigoroll commented 8 months ago

Thank you for your excellent bug report, @delthas ! It enabled me to write a reproducer test case in less than seven minutes. I'll see what I can do.

nigoroll commented 8 months ago

Stack trace with symbols from the reproducer:

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007f326bca9d9f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007f326bc5af32 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f326bc45472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x000055b5cc348ecf in pan_ic (func=0x7f32694ac150 <__func__.25> "dom_release", file=0x7f32694ab026 "vmod_dynamic.c", line=954, cond=0x7f32694ab3a3 "(dom->thread) == 0", kind=VAS_ASSERT) at cache/cache_panic.c:873
#5  0x000055b5cc40f875 in VAS_Fail (func=0x7f32694ac150 <__func__.25> "dom_release", file=0x7f32694ab026 "vmod_dynamic.c", line=954, cond=0x7f32694ab3a3 "(dom->thread) == 0", kind=VAS_ASSERT) at vas.c:67
#6  0x00007f32694a0ec8 in dom_release (dir=<optimized out>) at vmod_dynamic.c:954
#7  dom_destroy (dir=0x7f325ec0a000) at vmod_dynamic.c:979
#8  0x000055b5cc3828ba in vcldir_retire (vdir=0x7f325ec07000) at cache/cache_vrt_vcl.c:260
#9  0x000055b5cc37e0d3 in vcldir_deref (vdir=0x7f325ec07000) at cache/cache_vrt_vcl.c:278
#10 0x000055b5cc37df60 in VRT_DelDirector (dirp=0x7f325ec050e8) at cache/cache_vrt_vcl.c:300
#11 0x00007f32694a107a in dom_free (domp=domp@entry=0x7ffca6e25d28, why=why@entry=0x7f32694ab416 "expired") at vmod_dynamic.c:863
#12 0x00007f32694a1154 in dynamic_gc_expired (obj=obj@entry=0x7f326b501000) at vmod_dynamic.c:878
#13 0x00007f32694a3eb9 in dynamic_stop (obj=0x7f326b501000) at vmod_dynamic.c:893
#14 vmod_event (ctx=0x55b5cc4b7748 <ctx_cli>, priv=<optimized out>, e=<optimized out>) at vmod_dynamic.c:1159
#15 0x00007f32694bb15e in VGC_Cooldown (ctx=0x55b5cc4b7748 <ctx_cli>, ev=VCL_EVENT_COLD) at vgc.c:3999
#16 VGC_Event (ctx=0x55b5cc4b7748 <ctx_cli>, ev=<optimized out>) at vgc.c:4013
#17 0x000055b5cc361e18 in vcl_send_event (vcl=0x7f326b4451e0, ev=VCL_EVENT_COLD, msg=0x7ffca6e25ed8) at cache/cache_vcl.c:255
#18 0x000055b5cc3616f1 in vcl_set_state (vcl=0x7f326b4451e0, state=0x7f326b428390 "0cold", msg=0x7ffca6e25ed8) at cache/cache_vcl.c:597
#19 0x000055b5cc362f85 in vcl_cli_state (cli=0x7f326b4d29d0, av=0x7f326b42c580, priv=0x0) at cache/cache_vcl.c:861
#20 0x000055b5cc41406c in cls_dispatch (cli=0x7f326b4d29d0, cp=0x55b5cc4b2180 <vcl_cmds+128>, av=0x7f326b42c580, ac=3) at vcli_serve.c:255
#21 0x000055b5cc413c47 in cls_exec (cfd=0x7f326b4d29a0, av=0x7f326b42c580) at vcli_serve.c:320
#22 0x000055b5cc412a20 in cls_feed (cfd=0x7f326b4d29a0, p=0x7ffca6e26064 "\n2.1704482820.498056/vgc.so 1auto\no\nOVE.isra.0", e=0x7ffca6e26065 "2.1704482820.498056/vgc.so 1auto\no\nOVE.isra.0") at vcli_serve.c:416
#23 0x000055b5cc4124ea in VCLS_Poll (cs=0x7f326b422900, cli=0x7f326b4d29d0, timeout=-1) at vcli_serve.c:621
#24 0x000055b5cc318c14 in CLI_Run () at cache/cache_cli.c:110
#25 0x000055b5cc343105 in child_main (sigmagic=1, altstksz=81920) at cache/cache_main.c:467
#26 0x000055b5cc3b0ae4 in mgt_launch_child (cli=0x7f326b4d23b0) at mgt/mgt_child.c:408
#27 0x000055b5cc3b1a7e in mch_cli_server_start (cli=0x7f326b4d23b0, av=0x7f326b42c400, priv=0x0) at mgt/mgt_child.c:759
#28 0x000055b5cc41406c in cls_dispatch (cli=0x7f326b4d23b0, cp=0x55b5cc4b26b0 <cli_mch+64>, av=0x7f326b42c400, ac=1) at vcli_serve.c:255
#29 0x000055b5cc413c47 in cls_exec (cfd=0x7f326b4d2380, av=0x7f326b42c400) at vcli_serve.c:320
#30 0x000055b5cc412a20 in cls_feed (cfd=0x7f326b4d2380, 
    p=0x7ffca6e282f0 "\ntartse vcl1\nl1 << %XJEIFLH|)Xspa8P\nvcl 4.1;\nbackend s1 { .host = \"127.0.0.1\"; .port = \"46825\"; }\n\n\n\timport dynamic from \"/home/slink/Devel/varnish-git/libvmod-dynamic/src/.libs/libvmod_dynamic.so\";\n\n"..., 
    e=0x7ffca6e282f1 "tartse vcl1\nl1 << %XJEIFLH|)Xspa8P\nvcl 4.1;\nbackend s1 { .host = \"127.0.0.1\"; .port = \"46825\"; }\n\n\n\timport dynamic from \"/home/slink/Devel/varnish-git/libvmod-dynamic/src/.libs/libvmod_dynamic.so\";\n\n\t"...) at vcli_serve.c:416
#31 0x000055b5cc4124ea in VCLS_Poll (cs=0x7f326b421000, cli=0x7f326b4d23b0, timeout=0) at vcli_serve.c:621
#32 0x000055b5cc3b2a25 in mgt_cli_callback2 (e=0x7f326b4d9480, what=1) at mgt/mgt_cli.c:392
#33 0x000055b5cc4162c9 in VEV_Once (evb=0x7f326b429000) at vev.c:466
#34 0x000055b5cc415e0d in VEV_Loop (evb=0x7f326b429000) at vev.c:354
#35 0x000055b5cc3b7d3a in main (argc=34, argv=0x7ffca6e2a6f8) at mgt/mgt_main.c:995
nigoroll commented 8 months ago

So I added a workaround and the test case, but I think the root cause is actually a wrong event order in Varnish-Cache.