varnishcache / varnish-cache

Varnish Cache source code repository
https://www.varnish-cache.org
Other
3.65k stars 374 forks source link

Panic on vcl.load: Wrong turn at cache/cache_backend_probe.c:710: running #4199

Open delthas opened 1 week ago

delthas commented 1 week ago

Expected Behavior

No panic on the following VTC.

Important reproduction notes

This is against Varnish master + libvmod-dynamic master. I can only reproduce this when intentionally "slowing down" vmod_event with a sleep.

In my case in production, I originally had the panic when using libvmod-redis, and without adding its import in my VCL, the panic would not be reproduced. I suspected some kind of race condition that only happens when waiting in vmod_event somehow (because that would be the only side effect of loading libvmod-redis).

So, in order to reproduce the panic in the VCL below, add at the start of libvmod-dynamic/src/vmod_dynamic.c vmod_event:

#include <unistd.h>
usleep(100000);

(Or run the following from libvmod-dynamic: sed -z -i -E 's|(vmod_event\(VRT_CTX, struct vmod_priv \*priv, enum vcl_event_e e\)\n\{\n)|\1\#include \<unistd.h\>\nusleep\(100000\)\;\n|g' src/vmod_dynamic.c)

This will simulate a slow down in vmod_event by sleeping for 100ms.

Current Behavior

varnishtest "panic"

varnish v1 -vcl {
vcl 4.1;
backend local none;
} -start

varnish v1 -vcl {
        vcl 4.1;
        import std;
        import dynamic;
        import directors;

        backend local none;

        probe p1 {}
        probe p2 {}

        sub vcl_init {
                new cluster_default = directors.fallback();
                new r1 = dynamic.resolver();
                new i1 = dynamic.director(port = {"80"}, ttl = 10s, probe = p1, resolver = r1.use(), ttl_from = dns);
                cluster_default.add_backend(i1.backend(host={"example.com"}, port={"8000"}) );
                new r2 = dynamic.resolver();
                new i2 = dynamic.director(port = {"80"}, ttl = 10s, probe = p2, resolver = r2.use(), ttl_from = dns);
                cluster_default.add_backend(i2.backend(host={"google.com"}, port={"8001"}) );
        }
}

The panic is quite random, so might require several tries:

while true; do varnishtest ./panic.vtc; done

The panic is:

Wrong turn at cache/cache_backend_probe.c:712:
scheduled
Backtrace:
  ip=0x63557d52e435 sp=0x7fffaf88dde0 <VBT_format+0x35>
  ip=0x63557d459773 sp=0x7fffaf88de00 <pan_backtrace+0x33>
  ip=0x63557d459473 sp=0x7fffaf88de20 <pan_ic+0x3e3>
  ip=0x63557d52d5f5 sp=0x7fffaf88dfa0 <VAS_Fail+0x55>
  ip=0x63557d4180b1 sp=0x7fffaf88dff0 <VBP_Control+0x231>
  ip=0x63557d41514f sp=0x7fffaf88e020 <vbe_dir_event+0x15f>
  ip=0x63557d42ab8d sp=0x7fffaf88e050 <VDI_Event+0xbd>
  ip=0x63557d474b1c sp=0x7fffaf88e070 <vcl_BackendEvent+0x16c>
  ip=0x63557d473f6c sp=0x7fffaf88e0a0 <vcl_set_state+0x58c>
  ip=0x63557d476cdf sp=0x7fffaf88e0f0 <vcl_load+0x3bf>
  ip=0x63557d474f4d sp=0x7fffaf88e140 <vcl_cli_load+0xcd>
  ip=0x63557d5327d2 sp=0x7fffaf88e170 <cls_dispatch+0x392>
  ip=0x63557d5322bd sp=0x7fffaf88e1b0 <cls_exec+0x2fd>
  ip=0x63557d53101f sp=0x7fffaf88e200 <cls_feed+0x58f>
  ip=0x63557d5309ff sp=0x7fffaf88e270 <VCLS_Poll+0x3bf>
  ip=0x63557d423b02 sp=0x7fffaf8902d0 <CLI_Run+0xf2>
  ip=0x63557d4534f5 sp=0x7fffaf890300 <child_main+0x485>
  ip=0x63557d4c81da sp=0x7fffaf890370 <mgt_launch_child+0x63a>
  ip=0x63557d4c924e sp=0x7fffaf8903f0 <mch_cli_server_start+0x3e>
  ip=0x63557d5327d2 sp=0x7fffaf890420 <cls_dispatch+0x392>
  ip=0x63557d5322bd sp=0x7fffaf890460 <cls_exec+0x2fd>
  ip=0x63557d53101f sp=0x7fffaf8904b0 <cls_feed+0x58f>
  ip=0x63557d5309ff sp=0x7fffaf890520 <VCLS_Poll+0x3bf>
  ip=0x63557d4ca335 sp=0x7fffaf892580 <mgt_cli_callback2+0x25>
  ip=0x63557d534bbe sp=0x7fffaf8925a0 <VEV_Once+0x53e>
  ip=0x63557d53465d sp=0x7fffaf8925f0 <VEV_Loop+0xed>
  ip=0x63557d4d02f3 sp=0x7fffaf892620 <main+0x1ed3>
  ip=0x7b288cde1e08 sp=0x7fffaf892870
  ip=0x7b288cde1ecc sp=0x7fffaf892910 <__libc_start_main+0x8c>
  ip=0x63557d40df35 sp=0x7fffaf892970 <_start+0x25>

But sometimes the VCL state is running instead of scheduled.

Possible Solution

No response

Steps to Reproduce (for bugs)

No response

Context

This was particularly hard to track down to a MWE VTC :stuck_out_tongue:

Varnish Cache version

varnishd master; libvmod-dynamic master + usleep patch from ticket

Operating system

No response

Source of binary packages used (if any)

No response

delthas commented 1 week ago

From my understanding this could be related to https://github.com/varnishcache/varnish-cache/pull/4115 ?

cc @nigoroll

nigoroll commented 6 days ago

This is an excellent issue report, thank you! You even made an effort to patch vmod_dynamic with a cut&paste sed command...

(edit) The actual reproducer is a delay added for VCL_EVENT_WARM.

nigoroll commented 6 days ago

So what I think is happening here:

vmod_dynamic creates domain directors (basically resolver threads) also during vcl_init{}, the feature you are using which is vital to layering. These directors create backends whenever name resolution completes.

If the VCL temperature is still VCL_TEMP_INIT when these backends get created, all is good and well, they get added to the director list and get sent a warm event when the vcl becomes warm by this code:

https://github.com/varnishcache/varnish-cache/blob/e142af75decb675a554434a48e65e55bd0267f0c/bin/varnishd/cache/cache_vcl.c#L633-L637

But by adding the delay, we prolong the vcl_send_event() and chances get (very) high that backends get created when the VCL is already warm, but before the vcl_BackendEvent(). Then they receive two warm events and booom.

This touches on #4142 , but from the other end: Here we would need a director list from "before the warm event" and only these would need to get the event sent...

nigoroll commented 6 days ago

@delthas does #4205 work for you?

delthas commented 5 days ago

Thanks for the patch! #4205 fixes the issue on my end, the test case passes. :+1: