nigoroll / libvmod-dynamic

The Varnish dns/named director continued
BSD 2-Clause "Simplified" License
95 stars 34 forks source link

Still happens: Dynamic Backends can only be added to warm VCLs #117

Open nigoroll opened 3 months ago

nigoroll commented 3 months ago

I'm getting the same assert again, but in a slightly different way (even with the fix).

The stack is still:

Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x5f26789d8ebe sp=0x70bc775fe870 <pan_ic+0x34e>
  ip=0x5f2678aa5985 sp=0x70bc775fe9e0 <VAS_Fail+0x55>
  ip=0x5f2678a0ee86 sp=0x70bc775fea30 <VRT_AddDirector+0x956>
  ip=0x5f267899527d sp=0x70bc775feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x5f2678995d22 sp=0x70bc775fec80 <VRT_new_backend+0x132>
  ip=0x70bc82abfe61 sp=0x70bc775fecb0 <ref_add+0x660>
  ip=0x70bc82ac0acc sp=0x70bc775ff1d0 <dom_update+0xb69>
  ip=0x70bc82ac1395 sp=0x70bc775ff340 <dom_lookup_thread+0x485>
  ip=0x70bc99399ded sp=0x70bc775ff450
  ip=0x70bc9941d0dc sp=0x70bc775ff500

I'm stressing Varnish + libvmod-dynamic with a lot of VCL reloads. Everything usually works fine for a while (sometimes hours), until Varnish starts to loop on this error, every ~15s to ~1 minute. So, no issues for hours and then once I get this panic, it repeats every ~15s to ~1min: child crashes, child restarts, child crashes, ...

Here's an excerpt of the full logs:

Debug: Child (606349) Started
Child launched OK
Info: Child (606349) said Child starts
Error: Child (606349) not responding to CLI, killed it.
Error: Child (606349) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error (hdr)
Error: Child (606349) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (606349) died signal=3 (core dumped)
Debug: Child cleanup complete
Debug: Child (645026) Started
Child launched OK
Info: Child (645026) said Child starts
Error: Child (645026) not responding to CLI, killed it.
Error: Child (645026) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (645026) died signal=6 (core dumped)
Error: Child (645026) Panic at: Fri, 24 May 2024 14:58:32 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e6579fe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e6579fe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e6579fea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e6579feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e6579fec80 <VRT_new_backend+0x132>
  ip=0x76e661960e61 sp=0x76e6579fecb0 <ref_add+0x660>
  ip=0x76e661961acc sp=0x76e6579ff1d0 <dom_update+0xb69>
  ip=0x76e661962395 sp=0x76e6579ff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e6579ff450
  ip=0x76e676ebe0dc sp=0x76e6579ff500
Debug: Child cleanup complete
Debug: Child (649299) Started
Child launched OK
Info: Child (649299) said Child starts
Error: Child (649299) not responding to CLI, killed it.
Error: Child (649299) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (649299) died signal=6 (core dumped)
Error: Child (649299) Panic at: Fri, 24 May 2024 14:58:46 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e654bfe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e654bfe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e654bfea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e654bfeb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e654bfec80 <VRT_new_backend+0x132>
  ip=0x76e661686e61 sp=0x76e654bfecb0 <ref_add+0x660>
  ip=0x76e661687acc sp=0x76e654bff1d0 <dom_update+0xb69>
  ip=0x76e661688395 sp=0x76e654bff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e654bff450
  ip=0x76e676ebe0dc sp=0x76e654bff500
Debug: Child cleanup complete
Debug: Child (653709) Started
Child launched OK
Info: Child (653709) said Child starts
Error: Child (653709) not responding to CLI, killed it.
Error: Child (653709) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (653709) died signal=6 (core dumped)
Error: Child (653709) Panic at: Fri, 24 May 2024 14:59:01 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e6523fe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e6523fe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e6523fea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e6523feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e6523fec80 <VRT_new_backend+0x132>
  ip=0x76e66099fe61 sp=0x76e6523fecb0 <ref_add+0x660>
  ip=0x76e6609a0acc sp=0x76e6523ff1d0 <dom_update+0xb69>
  ip=0x76e6609a1395 sp=0x76e6523ff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e6523ff450
  ip=0x76e676ebe0dc sp=0x76e6523ff500
Debug: Child cleanup complete
Debug: Child (658076) Started
Child launched OK
Info: Child (658076) said Child starts
Error: Child (658076) not responding to CLI, killed it.
Error: Child (658076) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (658076) died signal=6 (core dumped)
Error: Child (658076) Panic at: Fri, 24 May 2024 14:59:21 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e655ffe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e655ffe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e655ffea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e655ffeb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e655ffec80 <VRT_new_backend+0x132>
  ip=0x76e6617a5e61 sp=0x76e655ffecb0 <ref_add+0x660>
  ip=0x76e6617a6acc sp=0x76e655fff1d0 <dom_update+0xb69>
  ip=0x76e6617a7395 sp=0x76e655fff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e655fff450
  ip=0x76e676ebe0dc sp=0x76e655fff500
Debug: Child cleanup complete
Debug: Child (662441) Started
Child launched OK
Info: Child (662441) said Child starts
Error: Child (662441) not responding to CLI, killed it.
Error: Child (662441) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (662441) died signal=6 (core dumped)
Error: Child (662441) Panic at: Fri, 24 May 2024 14:59:38 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e6567fe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e6567fe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e6567fea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e6567feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e6567fec80 <VRT_new_backend+0x132>
  ip=0x76e662eaae61 sp=0x76e6567fecb0 <ref_add+0x660>
  ip=0x76e662eabacc sp=0x76e6567ff1d0 <dom_update+0xb69>
  ip=0x76e662eac395 sp=0x76e6567ff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e6567ff450
  ip=0x76e676ebe0dc sp=0x76e6567ff500
Debug: Child cleanup complete

It continued crashing in a loop indefinitely for multiple days. I managed to attach to the process and the VCL was indeed COOL at the time of the panic.

Originally posted by @delthas in https://github.com/nigoroll/libvmod-dynamic/issues/108#issuecomment-2133195974

nigoroll commented 1 month ago

@delthas apologies for having lost track of this issue. If you have time, could you please (re)confirm that it still happens with latest Varnish-Cache? In particular, I would like to exclude the possibility of this being related to https://github.com/varnishcache/varnish-cache/issues/4108, for which the fix went in on Monday.

pnbecker commented 1 month ago

@nigoroll I can confirm this still happens with Varnish 7.5.0. Varnish 7.5.0 crashed here this morning with:

Wrong turn at cache/cache_vrt_vcl.c:252:
Dynamic Backends can only be added to warm VCLs

Panic.show in varnishadm reported:

varnish> panic.show
200        
Panic at: Fri, 19 Jul 2024 07:36:47 GMT
Wrong turn at cache/cache_vrt_vcl.c:252:
Dynamic Backends can only be added to warm VCLs
version = varnish-7.5.0 revision eef25264e5ca5f96a77129308edb83ccf84cb1b1, vrt api = 19.0
ident = Linux,5.15.111-flatcar,x86_64,-jnone,-smalloc,-sdefault,-hcritbit,epoll
now = 954234.585670 (mono), 1721374607.072795 (real)
Backtrace:
  0x55f72eefecfe: varnishd(+0x5ccfe) [0x55f72eefecfe]
  0x55f72ef7f2c5: varnishd(VAS_Fail+0x45) [0x55f72ef7f2c5]
  0x55f72ef2301b: varnishd(+0x8101b) [0x55f72ef2301b]
  0x55f72eed381d: varnishd(VRT_new_backend_clustered+0x45d) [0x55f72eed381d]
  0x7f976a0cf394: ./vmod_cache/_vmod_dynamic.6626bfa332c102136cbd5ce44844ca052182968667357aa177db213793d9b324(+0x7394) [0x7f976a0cf394]
  0x7f9777b2d134: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x7f9777b2d134]
  0x7f9777baca40: /lib/x86_64-linux-gnu/libc.so.6(__clone+0x40) [0x7f9777baca40]
argv = {
  [0] = \"varnishd\",
  [1] = \"-F\",
  [2] = \"-f\",
  [3] = \"/etc/varnish/default.vcl\",
  [4] = \"-a\",
  [5] = \"http=:80,HTTP\",
  [6] = \"-a\",
  [7] = \"proxy=:8443,PROXY\",
  [8] = \"-p\",
  [9] = \"feature=+http2\",
  [10] = \"-s\",
  [11] = \"malloc,27G\",
  [12] = \"-p\",
  [13] = \"feature=+http2\",
  [14] = \"-p\",
  [15] = \"default_keep=300\",
}
pthread.self = 0x7f9766e326c0
pthread.attr = {
  guard = 4096,
  stack_bottom = 0x7f9766633000,
  stack_top = 0x7f9766e33000,
  stack_size = 8388608,
}
thr.req = NULL
thr.busyobj = NULL
thr.worker = NULL
vmods = {
  std = {0x7f97774d7230, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},
  cookie = {0x7f97774d72a0, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},
  dynamic = {0x7f97774d7380, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 19.0},
  directors = {0x7f97774d73f0, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},
},
pools = {
  pool = 0x7f9770200000 {
    nidle = 96,
    nthr = 100,
    lqueue = 0
  },
  pool = 0x7f9770200640 {
    nidle = 96,
    nthr = 100,
    lqueue = 0
  },
},

The log of the varnish docker container contained more or less the same information:

Debug: Version: varnish-7.5.0 revision eef25264e5ca5f96a77129308edb83ccf84cb1b1,
Debug: Platform: Linux,5.15.111-flatcar,x86_64,-jnone,-smalloc,-sdefault,-hcritbit,
Debug: Child (22) Started,
Child launched OK,
Info: Child (22) said Child starts,
Error: Child (22) not responding to CLI, killed it.,
Error: Unexpected reply from ping: 400 CLI communication error (hdr),
Error: Child (22) died signal=9,
Debug: Child cleanup complete,
Debug: Child (4946) Started,
Child launched OK,
Error: Child (4946) not responding to CLI, killed it.,
Error: Child (4946) Pushing vcls failed:,
CLI communication error (hdr),
Debug: Stopping Child,
Info: Child (4946) said Child starts,
Error: Child (4946) died signal=6 (core dumped),
Error: Child (4946) Panic at: Fri, 19 Jul 2024 07:36:47 GMT,
Wrong turn at cache/cache_vrt_vcl.c:252:,
Dynamic Backends can only be added to warm VCLs,
version = varnish-7.5.0 revision eef25264e5ca5f96a77129308edb83ccf84cb1b1, vrt api = 19.0,
ident = Linux,5.15.111-flatcar,x86_64,-jnone,-smalloc,-sdefault,-hcritbit,epoll,
now = 954234.585670 (mono), 1721374607.072795 (real),
Backtrace:,
  0x55f72eefecfe: varnishd(+0x5ccfe) [0x55f72eefecfe],
  0x55f72ef7f2c5: varnishd(VAS_Fail+0x45) [0x55f72ef7f2c5],
  0x55f72ef2301b: varnishd(+0x8101b) [0x55f72ef2301b],
  0x55f72eed381d: varnishd(VRT_new_backend_clustered+0x45d) [0x55f72eed381d],
  0x7f976a0cf394: ./vmod_cache/_vmod_dynamic.6626bfa332c102136cbd5ce44844ca052182968667357aa177db213793d9b324(+0x7394) [0x7f976a0cf394],
  0x7f9777b2d134: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x7f9777b2d134],
  0x7f9777baca40: /lib/x86_64-linux-gnu/libc.so.6(__clone+0x40) [0x7f9777baca40],
argv = {,
  [0] = \"varnishd\",,
  [1] = \"-F\",,
  [2] = \"-f\",,
  [3] = \"/etc/varnish/default.vcl\",,
  [4] = \"-a\",,
  [5] = \"http=:80,HTTP\",,
  [6] = \"-a\",,
  [7] = \"proxy=:8443,PROXY\",,
  [8] = \"-p\",,
  [9] = \"feature=+http2\",,
  [10] = \"-s\",,
  [11] = \"malloc,27G\",,
  [12] = \"-p\",,
  [13] = \"feature=+http2\",,
  [14] = \"-p\",,
  [15] = \"default_keep=300\",,
},
pthread.self = 0x7f9766e326c0,
pthread.attr = {,
  guard = 4096,,
  stack_bottom = 0x7f9766633000,,
  stack_top = 0x7f9766e33000,,
  stack_size = 8388608,,
},
thr.req = NULL,
thr.busyobj = NULL,
thr.worker = NULL,,
{,
  std = {0x7f97774d7230, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},,
  cookie = {0x7f97774d72a0, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},,
  dynamic = {0x7f97774d7380, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 19.0},,
  directors = {0x7f97774d73f0, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},,
},,,
{,
  pool = 0x7f9770200000 {,
    nidle = 96,,
    nthr = 100,,
    lqueue = 0,
  },,
  pool = 0x7f9770200640 {,
    nidle = 96,,
    nthr = 100,,
    lqueue = 0,
  },,
},,
,
,
Debug: Child cleanup complete,
nigoroll commented 1 month ago

Could anyone who can reproduce this issue please run varnishd with this little patch and report the Panic?

diff --git a/bin/varnishd/cache/cache_vrt_vcl.c b/bin/varnishd/cache/cache_vrt_vcl.c
index 82e6e10a5..a17db5726 100644
--- a/bin/varnishd/cache/cache_vrt_vcl.c
+++ b/bin/varnishd/cache/cache_vrt_vcl.c
@@ -249,7 +249,7 @@ VRT_AddDirector(VRT_CTX, const struct vdi_methods *m, void *priv,
                return (vcldir_surplus(vdir));

        if (!temp->is_warm && temp != VCL_TEMP_INIT)
-               WRONG("Dynamic Backends can only be added to warm VCLs");
+               WRONG(temp->name);

        return (vdir->dir);
 }
nigoroll commented 1 month ago

I am trying hard to reproduce this with current code (https://github.com/varnishcache/varnish-cache/commit/7292f923dc731292226782a898a804f880567707 and vmod_dynamic e1b93074696ae811fe46e75e221d478bf024f530), no success.

Test vcl which I am using

vcl 4.1;

import dynamic;
import directors;

backend proforma none;

sub vcl_init {
    new r1 = dynamic.resolver();
    r1.set_resolution_type(RECURSING); # tried with and without

    new d = dynamic.director(
        resolver = r1.use(),
        ttl_from = dns
    );

    new rr = directors.round_robin();
    rr.add_backend(d.backend("www.uplex.de"));
}
sub vcl_recv {
    return (synth(200));
}
sub vcl_synth {
    set resp.body = "" + rr.backend().resolve() + " " +
        d.backend("uplex.de").resolve() + """
""";
    return (deliver);
}

start:

varnishd -a 127.0.0.1:8080 -f /tmp/t.vcl

vcl load loop:

while : ; do n=v$RANDOM ; varnishadm vcl.load $n /tmp/t.vcl && varnishadm vcl.use $n ; varnishadm vcl.list | awk '/^av/ {print "vcl.discard " $5}'|varnishadm ; varnishadm panic.show ; done

curl loop (in a second terminal):

while : ; do curl localhost:8080 ; done

I just don't get the panic, also not with this addition to deliberately delay resolution:

diff --git a/src/dyn_resolver_getdns.c b/src/dyn_resolver_getdns.c
index c674f1b..f97b761 100644
--- a/src/dyn_resolver_getdns.c
+++ b/src/dyn_resolver_getdns.c
@@ -32,6 +32,7 @@
 //#include <sys/types.h>
 #include <sys/socket.h>
 #include <netdb.h>
+#include <unistd.h>

 #ifdef __FreeBSD__
 #include <arpa/inet.h>
@@ -203,6 +204,7 @@ getdns_lookup(struct VPFX(dynamic_resolver) *r,
        ret = getdns_address_sync(c->context, node, NULL, &state->response);
        errchk(ret);

+       usleep(1000 * 1000);
        return (getdns_common_lookup_check(state));
 }

So, can anyone please help me with a reproducer?

nigoroll commented 1 month ago

@nigoroll I can confirm this still happens with Varnish 7.5.0. Varnish 7.5.0 crashed here this morning with:

FTR: The container build is https://github.com/varnish/docker-varnish/blob/master/fresh/debian/Dockerfile#L9C26-L9C66 which uses https://github.com/nigoroll/libvmod-dynamic/commit/5dc09f52cd8eeed77d879b0313bd8ad9a749477f, which is a minor change on top of https://github.com/nigoroll/libvmod-dynamic/commit/47bfcf54748ca47ad3bb35e20693a6d96eb5e77f

So if this was root-caused in dynamic, it should still happen.

nigoroll commented 1 month ago

Oops, I did not mean to close this issue before confirmation. (too much github magic, closing issues across projects is, hm, interesting)