varnishcache / varnish-cache

Varnish Cache source code repository
https://www.varnish-cache.org
Other
3.64k stars 374 forks source link

Assert error in mgt_vcl_discard(): Condition((vp->warm) #4049

Open nigoroll opened 8 months ago

nigoroll commented 8 months ago

With #4048 applied, the test case from https://github.com/nigoroll/libvmod-dynamic/issues/110 can easily trigger this assertion failure when a lookup thread takes longer than cli_timeout to finish (I first saw this by accident while träwelling on an ICE train):

***  v1    debug|Error: Child (501422) not responding to CLI, killed it.
***  v1    debug|Assert error in mgt_vcl_discard(), mgt/mgt_vcl.c line 701:
***  v1    debug|  Condition((vp->warm) == 0) not true.

This happens when the mgt_vcl_askchild() times out, because the worker process is processing the cold event for longer.

This hack avoids the issue:

diff --git a/bin/varnishd/mgt/mgt_vcl.c b/bin/varnishd/mgt/mgt_vcl.c
index 9c4e01e9c..e20ebf0c0 100644
--- a/bin/varnishd/mgt/mgt_vcl.c
+++ b/bin/varnishd/mgt/mgt_vcl.c
@@ -696,6 +696,10 @@ mgt_vcl_discard(struct cli *cli, struct vclprog *vp)
                vp->warm = 0;
        } else {
                (void)mgt_vcl_setstate(cli, vp, VCL_STATE_COLD);
+               while (vp->warm) {
+                       usleep(1000*1000);
+                       (void)mgt_vcl_setstate(cli, vp, VCL_STATE_COLD);
+               }
        }
        if (MCH_Running()) {
                AZ(vp->warm);

How should we fix this issue? Some ideas:

nigoroll commented 7 months ago

bugwash: just dial up cli_timeout to some infinity-ish value. good point though: phk: slink, we may want to use a smaller timeout still for the MGT's ping/pong's...