nigoroll / libvmod-dynamic

The Varnish dns/named director continued
BSD 2-Clause "Simplified" License
95 stars 34 forks source link

Assert error in VRT_SetHealth(), cache/cache_vcl_vrt.c line 223 #70

Closed wimsymons closed 3 years ago

wimsymons commented 3 years ago

Hi,

We found a (concurrency?) bug, running on Varnish 6.1.1.

Not sure that it is in this module, but it happened when a deploy occurred on the infrastructure behind on of the dynamic directors (AWS Beanstalk). The load balancer got new instances, thus the IP addresses changed.

Varnish child dies, this is posted in the logs:

Mar 16 16:04:24 ip-172-22-84-104 varnishd[19503]: Child (19522) not responding to CLI, killed it.
Mar 16 16:04:24 ip-172-22-84-104 varnishd[19503]: Unexpected reply from ping: 400 CLI communication error (hdr)
Mar 16 16:04:24 ip-172-22-84-104 varnishd[19503]: Child (19522) died signal=6
Mar 16 16:04:24 ip-172-22-84-104 varnishd[19503]: Child (19522) Panic at: Tue, 16 Mar 2021 16:04:24 GMT
                                                  Assert error in VRT_SetHealth(), cache/cache_vcl_vrt.c line 223:
                                                    Condition((d) != NULL) not true.
                                                  version = varnish-6.1.1 revision efc2f6c1536cf2272e471f5cff5f145239b19460, vrt api = 8.0
                                                  ident = Linux,4.19.0-14-amd64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
                                                  now = 714264.543532 (mono), 1615910663.806864 (real)
                                                  Backtrace:
                                                    0x5616f60a3d2a: /usr/sbin/varnishd(+0x4cd2a) [0x5616f60a3d2a]
                                                    0x5616f610bcc3: /usr/sbin/varnishd(VAS_Fail+0x13) [0x5616f610bcc3]
                                                    0x5616f60b6c0b: /usr/sbin/varnishd(VRT_SetHealth+0x6b) [0x5616f60b6c0b]
                                                    0x5616f6083520: /usr/sbin/varnishd(VBP_Remove+0x60) [0x5616f6083520]
                                                    0x5616f607f83e: /usr/sbin/varnishd(+0x2883e) [0x5616f607f83e]
                                                    0x5616f60b69d2: /usr/sbin/varnishd(VRT_DelDirector+0x112) [0x5616f60b69d2]
                                                    0x5616f6081509: /usr/sbin/varnishd(VBE_Poll+0xc9) [0x5616f6081509]
                                                    0x5616f6087d0d: /usr/sbin/varnishd(+0x30d0d) [0x5616f6087d0d]
                                                    0x5616f610dd02: /usr/sbin/varnishd(+0xb6d02) [0x5616f610dd02]
                                                    0x5616f610e369: /usr/sbin/varnishd(VCLS_Poll+0x2e9) [0x5616f610e369]
                                                  thread = (cache-main)
                                                  thr.req = (nil) {
                                                  },
                                                  thr.busyobj = (nil) {
                                                  },
Mar 16 16:04:24 ip-172-22-84-104 varnishd[19503]: Child cleanup complete
Mar 16 16:04:24 ip-172-22-84-104 varnishd[19503]: Child (30911) Started
Mar 16 16:04:24 ip-172-22-84-104 varnishd[19503]: Child (30911) said Child starts

Some snippets from the VCL config:

probe dynamic_probe {
  .interval = 5s;
  .timeout = 5s;
  .initial = 3;
  .threshold = 2;
  .window = 3;
  .request =
    "GET /health HTTP/1.1"
    "User-Agent: Varnish Probe"
    "Host: localhost"
    "Connection: close";
}

...

sub vcl_init {
  new dynamic_director = dynamic.director(
    port = "80",
    ttl = 2m,
    connect_timeout = 3s,
    first_byte_timeout = 10s,
    between_bytes_timeout = 5s,
    probe = dynamic_probe
  );
}

The libvmod-dynamic module was compiled on Thu, 15 Oct 2020 09:12:00 +0000 against the latest code now on the 6.1 branch (https://github.com/nigoroll/libvmod-dynamic/commit/4931cfd08ad6f142a9f13b064c8a5aa0723f84a1).

We need to restart the Varnish Prometheus exporter after that to send out correct metrics after that, but that's another issue.

nigoroll commented 3 years ago

I believe this would be a bug in https://github.com/varnishcache/varnish-cache, but anyway, you are running code which is now almost 2.5 years old and long EOL. In particular, VRT_SetHealth has been removed from varnish-cache since. So I will not work on this issue outside a support agreement.

wimsymons commented 3 years ago

Thanks for your input. I will take a look if we can upgrade to a later release. At the moment we're installing the latest version from Debian Buster.