Closed nigoroll closed 7 months ago
As is, this issue is hard. I could easily add a workaround, but that "workaround" basically also exists in Varnish-Cache, because a shortcut is taken in VRT_AddDirector()
if the VCL is cooling, except in vtc mode:
So for all practical purposes, this issue exists primarily in VTC, but we can not guarantee that it does not happen in production deployments.
Holding the vcl_mtx goes back to https://github.com/varnishcache/varnish-cache/issues/3094 / https://github.com/varnishcache/varnish-cache/commit/465f2f8c364cb7e5c6ae93bf46f2ba0f7c757d92 and I do now wonder if this is actually a good idea for sending VCL_EVENT_COLD
....
More pondering: When sending the cold event, we need to ensure that the list of directors operated upon is complete, so holding the vcl_mtx
makes sense, otherwise we would need a more elaborate data structure and synchronization to allow additions/deletions while iterating.
The temperature can only change in the CLI thread, so it can not change again while a cold event is posted. Also, once the temperature goes cold, it can not change back to something else for as long as the lookup thread is running. Maybe this can help us...
edit: the actual temperature change happens outside vcl_mtx
Hi, I agree to add the test case :smile:
some notes: we can not join the resolver thread after the COLD transition has completed, because by then (during discard) the resolver context may have become invalid (for the same reason we can not detach it). I am pretty much out of ideas even including changes to varnish-cache: for example, turning vcl_mtx
into an rwlock would at first seem like an option (iterators only need to read), but in VRT_AddDirector()
, we would need to check the temperature on a read lock and, if it is not cooling, upgrade to a write lock, but that can deadlock, and the moment we unlock, vcl_BackendEvent()
could run again and we never get the lock.
I am really tempted to ignore the problem, because the race for the window in production code (without vtc_mode) is only https://github.com/varnishcache/varnish-cache/blob/2511f5364489c8345b3879a5eee21179bf71e9b2/bin/varnishd/cache/cache_vrt_vcl.c#L178-L211
At this point I think this can only be fixed in varnish-cache, see https://github.com/varnishcache/varnish-cache/pull/4048
@delthas reported another issue based on a test case which I could have looked at in more detail earlier (I did not because I did not want to use
example.com
, but I really should have):(I have slightly modified the test case - @delthas, do you agree to add it?)
the issue here is that
VDI_Event(..., VCL_EVENT_COLD)
waits for the lookup threads to finish while holding thevcl_mtx
, which prevents the lookup threads to ... finish: