nigoroll / libvmod-dynamic

The Varnish dns/named director continued
BSD 2-Clause "Simplified" License
96 stars 34 forks source link

assertion `(dom->status == DYNAMIC_ST_STARTING)` failing sporadically #105

Closed kaiburjack closed 1 year ago

kaiburjack commented 1 year ago

Description

We are currently evaluating Varnish as a cache behind an ingress in our Kubernetes setup. For this, we use the varnish:7.4.1-alpine Docker image, which also contains this libvmod-dynamic module. We then use a very simple VCL to begin with:

vcl 4.1;
import dynamic;
backend default none;
sub vcl_init {
  new d = dynamic.director(port = "80");
}
sub vcl_recv {
  set req.backend_hint = d.backend(req.http.host);
}

As a side-note: This is also used with an Istio service mesh, so in the end Istio will determine the actual destination IP to route to after Varnish sends the request to the supposed req.http.host backend based on the HTTP Host/Authority header.

So far so good. Everything works well but occasionally we see this assertion failing: https://github.com/nigoroll/libvmod-dynamic/blob/master/src/vmod_dynamic.c#L788

This does not happen very often (only about 1 or 2 times a day), but it does happen and I wonder, why.

Initial analysis

Looking at the code, the function where this assertion fails is the entrypoint function for a pthread spawned by https://github.com/nigoroll/libvmod-dynamic/blob/master/src/vmod_dynamic.c#L1012 .

Before the thread is spawned, the dom->status is set to the asserted DYNAMIC_ST_STARTING;.

And both the writing and reading of dom->status are covered by a lock/mutex. And there is definitely a happens-before relation between the code setting dom->status = DYNAMIC_ST_STARTING; and the pthread function seeing the value of dom->status. However there is no guarantee that another event in dom_event might not set a different value for dom->status after the processing of the VCL_EVENT_WARM event and the effective start of the pthread function in a separately spawned thread. Could there be maybe a VCL_EVENT_COLD event fired right after the VCL_EVENT_WARM which would set dom->status again to a different value, making the assertion then fail?

Coming to think of the actual assertion dom->status == DYNAMIC_ST_STARTING: What is being asserted there and why? Because the while loop directly afterwards will also check if the condition for whether the thread should run is still true while (dom->status <= DYNAMIC_ST_ACTIVE) { and I think this would also cover the check of status == DYNAMIC_ST_STARTING, since DYNAMIC_ST_STARTING <= DYNAMIC_ST_ACTIVE would be true as well.

stdout/stderr output of Varnish

The following are what Varnish's container will output to stdout/stderr when the assertion fails and the child process is killed and restarted:

Error: Child (254) died signal=6
Error: Child (254) Panic at: Sat, 07 Oct 2023 22:15:50 GMT
Assert error in dom_lookup_thread(), vmod_dynamic.c line 788:
  Condition(dom->status == DYNAMIC_ST_STARTING) not true.
version = varnish-7.4.1 revision d5a5aa9cc879320840ca467ddbb7df0f99c9ba0f, vrt api = 18.0
ident = Linux,5.15.120+,x86_64,-jnone,-smalloc,-sdefault,-hcritbit,epoll
now = 1659349.487516 (mono), 1696716950.705925 (real)
Backtrace:
  ip=0x5ccdac571cdf sp=0x7b10d72ae240
  ip=0x5ccdac5d089b sp=0x7b10d72ae390
  ip=0x7b10d75e061c sp=0x7b10d72ae3d0 <dom_lookup_thread+0x140c>
  ip=0x7b10dde66285 sp=0x7b10d72aeac0 <pthread_exit+0x277>
argv = {
  [0] = \"varnishd\",
  [1] = \"-F\",
  [2] = \"-f\",
  [3] = \"/etc/varnish/default.vcl\",
  [4] = \"-a\",
  [5] = \"http=:8080,HTTP\",
  [6] = \"-a\",
  [7] = \"proxy=:8443,PROXY\",
  [8] = \"-p\",
  [9] = \"feature=+http2\",
  [10] = \"-s\",
  [11] = \"malloc,1073741824\",
  [12] = \"-n\",
  [13] = \"/tmp/varnish_workdir\",
  [14] = \"-t\",
  [15] = \"5s\",
  [16] = \"-p\",
  [17] = \"default_grace=10s\",
  [18] = \"-p\",
  [19] = \"default_keep=0s\",
}
pthread.self = 0x7b10d72aeb38
pthread.attr = {
  guard = 8192,
  stack_bottom = 0x7b10d728e000,
  stack_top = 0x7b10d72aeb08,
  stack_size = 133896,
}
thr.req = NULL
thr.busyobj = NULL
thr.worker = NULL
vmods = {
  std = {0x7b10d858cbc0, Varnish 7.4.1 d5a5aa9cc879320840ca467ddbb7df0f99c9ba0f, 0.0},
  dynamic = {0x7b10d858cc30, Varnish 7.4.1 d5a5aa9cc879320840ca467ddbb7df0f99c9ba0f, 18.0},
  basicauth = {0x7b10d858cca0, Varnish 7.4.1 d5a5aa9cc879320840ca467ddbb7df0f99c9ba0f, 0.0},
},
pools = {
  pool = 0x7b10d8c68c20 {
    nidle = 93,
    nthr = 100,
    lqueue = 0
  },
  pool = 0x7b10d8c68d50 {
    nidle = 96,
    nthr = 100,
    lqueue = 0
  },
},

Debug: Child cleanup complete
Debug: Child (480) Started
Child launched OK
Info: Child (480) said Child starts
nigoroll commented 1 year ago

Thank you for your detailed report, I highly appreciate that you put some real effort into the analysis. In general, the domain state machine is an area where I have recently made changes (mostly b54c48849a0ffb940f87d641c0b13584cb21540e and follow-ups). The reason is that the old lifetime model (which I had inherited from the original author of this vmod) did not work any more for layered configurations and with director reference counting. The original intention behind the assertion was indeed that nothing should happen to the domain state between initializing it and the lookup thread getting ready. I will need to stare at this again, but at least I wanted to give you some feedback.

Regarding your hypothesis about VCL events: Can you give me an idea about how often you load/discard VCLs? The setup sounded like you don't?

kaiburjack commented 1 year ago

Thanks for your answer and explanation! VCL reloads haven't occurred during those times. Also, looking more into it, the number of requests fluctuated and these errors happened when transitioning between a period of very low requests per second (e.g. 0.1 req/sec) to about 150 req/sec. This was due to e2e/acceptance tests for the backend-for-frontend which ran behind Varnish and was tested by browser-based e2e tests. It seemed that whenever those e2e tests started, the assertion failure happened sometimes (not always) at the beginning of the test. The test duration was only a few minutes.

nigoroll commented 1 year ago

@kaiburjack your hypothesis on the root cause was correct. See commit message for details.

In your particular case, if you are using the same domains for all tests, you might be able to reduce the probability of this issue hitting by setting domain_usage_timeout to a very high value such that domain director objects basically never time out. Another option would be to prime domain creation by sending just a single request for each domain that you are using before hitting the cache with parallel requests.

Again, thank you for your very good,exemplary bug report.