Closed delthas closed 9 months ago
IRC log ftr:
(16:12:40) cc0: slink: about the libvmod-dynamic issue with cold/cooling while processing the lookup thread results. would you have an estimate of when you'd be able to work on a fix? in the meantime, since this affects my production i've started to make a fix. (16:12:40) cc0: following your suggestion i tried taking a ref to the VCL while the lookup thread was up, ie: VCL_Prevent_Cold when the lookup thread starts, then VCL_Allow_Cold when the lookup thread stops; but i get an assert saying that vcl->busy should be >0 but is 0. so i suppose i should only VCL_Prevent_Cold later, when i am warm, but the patch gets tricky. (16:12:40) cc0: (and also, IIUC, we wouldn't even know we have to go Done until all the domains time out, which means the VCL would stay busy for up to max(domain_usage_timeout), which sounds like too much?) (16:12:40) cc0: another way to address this (less clean?) would be to skip processing if we are done i suppose; so i added
if (dom->status == DYNAMIC_ST_DONE) continue;
in dom_update in the loop processing the dns results, and it seems to pass existing tests + mine. it looks suspiciously "too simple" though. what do you think? ... (18:22:36) slink: cc0 I hope next week (18:25:09) slink: cc0 the DYNAMIC_ST_DONE sounds like it could serve as a workaround. upfront I would think it was still racy, but I would need to look. (18:25:09) slink: you discovered one additional consideration with taking the vcl ref, but a more fundamental one is that creating directors can fail during VCL_COOLING (return null). at the moment the code assumes that missing directors will eventually appear. Again, this was only from my first look at it, I need to spend a couple of hours on this.
I think the root cause here is actually the problem I mentioned in https://github.com/nigoroll/libvmod-dynamic/issues/109#issuecomment-1912410690: With the wholesale VRT_VCL_Prevent_Discard()
for the vmod, we rely on the vmod/director event dispatch order being correct. I will have another look if we can use more vcl references...
This affects 7.4 only and because I did not want to keep workarounds in the main branch, I have now created https://github.com/nigoroll/libvmod-dynamic/tree/7.4
I'm getting the same assert again, but in a slightly different way (even with the fix).
The stack is still:
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
ip=0x5f26789d8ebe sp=0x70bc775fe870 <pan_ic+0x34e>
ip=0x5f2678aa5985 sp=0x70bc775fe9e0 <VAS_Fail+0x55>
ip=0x5f2678a0ee86 sp=0x70bc775fea30 <VRT_AddDirector+0x956>
ip=0x5f267899527d sp=0x70bc775feb80 <VRT_new_backend_clustered+0xc0d>
ip=0x5f2678995d22 sp=0x70bc775fec80 <VRT_new_backend+0x132>
ip=0x70bc82abfe61 sp=0x70bc775fecb0 <ref_add+0x660>
ip=0x70bc82ac0acc sp=0x70bc775ff1d0 <dom_update+0xb69>
ip=0x70bc82ac1395 sp=0x70bc775ff340 <dom_lookup_thread+0x485>
ip=0x70bc99399ded sp=0x70bc775ff450
ip=0x70bc9941d0dc sp=0x70bc775ff500
I'm stressing Varnish + libvmod-dynamic with a lot of VCL reloads. Everything usually works fine for a while (sometimes hours), until Varnish starts to loop on this error, every ~15s to ~1 minute. So, no issues for hours and then once I get this panic, it repeats every ~15s to ~1min: child crashes, child restarts, child crashes, ...
Here's an excerpt of the full logs:
Debug: Child (606349) Started
Child launched OK
Info: Child (606349) said Child starts
Error: Child (606349) not responding to CLI, killed it.
Error: Child (606349) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error (hdr)
Error: Child (606349) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (606349) died signal=3 (core dumped)
Debug: Child cleanup complete
Debug: Child (645026) Started
Child launched OK
Info: Child (645026) said Child starts
Error: Child (645026) not responding to CLI, killed it.
Error: Child (645026) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (645026) died signal=6 (core dumped)
Error: Child (645026) Panic at: Fri, 24 May 2024 14:58:32 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
ip=0x6281a32f6ebe sp=0x76e6579fe870 <pan_ic+0x34e>
ip=0x6281a33c3985 sp=0x76e6579fe9e0 <VAS_Fail+0x55>
ip=0x6281a332ce86 sp=0x76e6579fea30 <VRT_AddDirector+0x956>
ip=0x6281a32b327d sp=0x76e6579feb80 <VRT_new_backend_clustered+0xc0d>
ip=0x6281a32b3d22 sp=0x76e6579fec80 <VRT_new_backend+0x132>
ip=0x76e661960e61 sp=0x76e6579fecb0 <ref_add+0x660>
ip=0x76e661961acc sp=0x76e6579ff1d0 <dom_update+0xb69>
ip=0x76e661962395 sp=0x76e6579ff340 <dom_lookup_thread+0x485>
ip=0x76e676e3aded sp=0x76e6579ff450
ip=0x76e676ebe0dc sp=0x76e6579ff500
Debug: Child cleanup complete
Debug: Child (649299) Started
Child launched OK
Info: Child (649299) said Child starts
Error: Child (649299) not responding to CLI, killed it.
Error: Child (649299) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (649299) died signal=6 (core dumped)
Error: Child (649299) Panic at: Fri, 24 May 2024 14:58:46 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
ip=0x6281a32f6ebe sp=0x76e654bfe870 <pan_ic+0x34e>
ip=0x6281a33c3985 sp=0x76e654bfe9e0 <VAS_Fail+0x55>
ip=0x6281a332ce86 sp=0x76e654bfea30 <VRT_AddDirector+0x956>
ip=0x6281a32b327d sp=0x76e654bfeb80 <VRT_new_backend_clustered+0xc0d>
ip=0x6281a32b3d22 sp=0x76e654bfec80 <VRT_new_backend+0x132>
ip=0x76e661686e61 sp=0x76e654bfecb0 <ref_add+0x660>
ip=0x76e661687acc sp=0x76e654bff1d0 <dom_update+0xb69>
ip=0x76e661688395 sp=0x76e654bff340 <dom_lookup_thread+0x485>
ip=0x76e676e3aded sp=0x76e654bff450
ip=0x76e676ebe0dc sp=0x76e654bff500
Debug: Child cleanup complete
Debug: Child (653709) Started
Child launched OK
Info: Child (653709) said Child starts
Error: Child (653709) not responding to CLI, killed it.
Error: Child (653709) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (653709) died signal=6 (core dumped)
Error: Child (653709) Panic at: Fri, 24 May 2024 14:59:01 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
ip=0x6281a32f6ebe sp=0x76e6523fe870 <pan_ic+0x34e>
ip=0x6281a33c3985 sp=0x76e6523fe9e0 <VAS_Fail+0x55>
ip=0x6281a332ce86 sp=0x76e6523fea30 <VRT_AddDirector+0x956>
ip=0x6281a32b327d sp=0x76e6523feb80 <VRT_new_backend_clustered+0xc0d>
ip=0x6281a32b3d22 sp=0x76e6523fec80 <VRT_new_backend+0x132>
ip=0x76e66099fe61 sp=0x76e6523fecb0 <ref_add+0x660>
ip=0x76e6609a0acc sp=0x76e6523ff1d0 <dom_update+0xb69>
ip=0x76e6609a1395 sp=0x76e6523ff340 <dom_lookup_thread+0x485>
ip=0x76e676e3aded sp=0x76e6523ff450
ip=0x76e676ebe0dc sp=0x76e6523ff500
Debug: Child cleanup complete
Debug: Child (658076) Started
Child launched OK
Info: Child (658076) said Child starts
Error: Child (658076) not responding to CLI, killed it.
Error: Child (658076) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (658076) died signal=6 (core dumped)
Error: Child (658076) Panic at: Fri, 24 May 2024 14:59:21 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
ip=0x6281a32f6ebe sp=0x76e655ffe870 <pan_ic+0x34e>
ip=0x6281a33c3985 sp=0x76e655ffe9e0 <VAS_Fail+0x55>
ip=0x6281a332ce86 sp=0x76e655ffea30 <VRT_AddDirector+0x956>
ip=0x6281a32b327d sp=0x76e655ffeb80 <VRT_new_backend_clustered+0xc0d>
ip=0x6281a32b3d22 sp=0x76e655ffec80 <VRT_new_backend+0x132>
ip=0x76e6617a5e61 sp=0x76e655ffecb0 <ref_add+0x660>
ip=0x76e6617a6acc sp=0x76e655fff1d0 <dom_update+0xb69>
ip=0x76e6617a7395 sp=0x76e655fff340 <dom_lookup_thread+0x485>
ip=0x76e676e3aded sp=0x76e655fff450
ip=0x76e676ebe0dc sp=0x76e655fff500
Debug: Child cleanup complete
Debug: Child (662441) Started
Child launched OK
Info: Child (662441) said Child starts
Error: Child (662441) not responding to CLI, killed it.
Error: Child (662441) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (662441) died signal=6 (core dumped)
Error: Child (662441) Panic at: Fri, 24 May 2024 14:59:38 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
ip=0x6281a32f6ebe sp=0x76e6567fe870 <pan_ic+0x34e>
ip=0x6281a33c3985 sp=0x76e6567fe9e0 <VAS_Fail+0x55>
ip=0x6281a332ce86 sp=0x76e6567fea30 <VRT_AddDirector+0x956>
ip=0x6281a32b327d sp=0x76e6567feb80 <VRT_new_backend_clustered+0xc0d>
ip=0x6281a32b3d22 sp=0x76e6567fec80 <VRT_new_backend+0x132>
ip=0x76e662eaae61 sp=0x76e6567fecb0 <ref_add+0x660>
ip=0x76e662eabacc sp=0x76e6567ff1d0 <dom_update+0xb69>
ip=0x76e662eac395 sp=0x76e6567ff340 <dom_lookup_thread+0x485>
ip=0x76e676e3aded sp=0x76e6567ff450
ip=0x76e676ebe0dc sp=0x76e6567ff500
Debug: Child cleanup complete
It continued crashing in a loop indefinitely for multiple days. I managed to attach to the process and the VCL was indeed COOL at the time of the panic.
I got the following panic in production:
Call stack is probably: dom_lookup_thread -> dom_update -> ref_add -> VRT_new_backend -> VRT_new_backend_clustered -> VRT_AddDirector
Could there possibly be a race condition between the lookup thread that could take a while to look up DNS records, then try to insert the new backends while the VCL was unloaded/set to cooling?