Closed kaiburjack closed 1 year ago
Thank you for your detailed report, I highly appreciate that you put some real effort into the analysis. In general, the domain state machine is an area where I have recently made changes (mostly b54c48849a0ffb940f87d641c0b13584cb21540e and follow-ups). The reason is that the old lifetime model (which I had inherited from the original author of this vmod) did not work any more for layered configurations and with director reference counting. The original intention behind the assertion was indeed that nothing should happen to the domain state between initializing it and the lookup thread getting ready. I will need to stare at this again, but at least I wanted to give you some feedback.
Regarding your hypothesis about VCL events: Can you give me an idea about how often you load/discard VCLs? The setup sounded like you don't?
Thanks for your answer and explanation! VCL reloads haven't occurred during those times. Also, looking more into it, the number of requests fluctuated and these errors happened when transitioning between a period of very low requests per second (e.g. 0.1 req/sec) to about 150 req/sec. This was due to e2e/acceptance tests for the backend-for-frontend which ran behind Varnish and was tested by browser-based e2e tests. It seemed that whenever those e2e tests started, the assertion failure happened sometimes (not always) at the beginning of the test. The test duration was only a few minutes.
@kaiburjack your hypothesis on the root cause was correct. See commit message for details.
In your particular case, if you are using the same domains for all tests, you might be able to reduce the probability of this issue hitting by setting domain_usage_timeout
to a very high value such that domain director objects basically never time out.
Another option would be to prime domain creation by sending just a single request for each domain that you are using before hitting the cache with parallel requests.
Again, thank you for your very good,exemplary bug report.
Description
We are currently evaluating Varnish as a cache behind an ingress in our Kubernetes setup. For this, we use the varnish:7.4.1-alpine Docker image, which also contains this libvmod-dynamic module. We then use a very simple VCL to begin with:
As a side-note: This is also used with an Istio service mesh, so in the end Istio will determine the actual destination IP to route to after Varnish sends the request to the supposed
req.http.host
backend based on the HTTP Host/Authority header.So far so good. Everything works well but occasionally we see this assertion failing: https://github.com/nigoroll/libvmod-dynamic/blob/master/src/vmod_dynamic.c#L788
This does not happen very often (only about 1 or 2 times a day), but it does happen and I wonder, why.
Initial analysis
Looking at the code, the function where this assertion fails is the entrypoint function for a pthread spawned by https://github.com/nigoroll/libvmod-dynamic/blob/master/src/vmod_dynamic.c#L1012 .
Before the thread is spawned, the
dom->status
is set to the assertedDYNAMIC_ST_STARTING;
.And both the writing and reading of
dom->status
are covered by a lock/mutex. And there is definitely a happens-before relation between the code settingdom->status = DYNAMIC_ST_STARTING;
and the pthread function seeing the value ofdom->status
. However there is no guarantee that another event indom_event
might not set a different value fordom->status
after the processing of theVCL_EVENT_WARM
event and the effective start of the pthread function in a separately spawned thread. Could there be maybe aVCL_EVENT_COLD
event fired right after theVCL_EVENT_WARM
which would setdom->status
again to a different value, making the assertion then fail?Coming to think of the actual assertion
dom->status == DYNAMIC_ST_STARTING
: What is being asserted there and why? Because the while loop directly afterwards will also check if the condition for whether the thread should run is still truewhile (dom->status <= DYNAMIC_ST_ACTIVE) {
and I think this would also cover the check ofstatus == DYNAMIC_ST_STARTING
, sinceDYNAMIC_ST_STARTING <= DYNAMIC_ST_ACTIVE
would be true as well.stdout/stderr output of Varnish
The following are what Varnish's container will output to stdout/stderr when the assertion fails and the child process is killed and restarted: