nigoroll / libvmod-dynamic

The Varnish dns/named director continued
BSD 2-Clause "Simplified" License
96 stars 34 forks source link

Assert: Dynamic Backends can only be added to warm VCLs #108

Closed delthas closed 9 months ago

delthas commented 10 months ago

I got the following panic in production:

Wrong turn at cache/cache_vrt_vcl.c:224:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  0x557c3b6a7872: /usr/sbin/varnishd(+0x5b872) [0x557c3b6a7872]
  0x557c3b724745: /usr/sbin/varnishd(VAS_Fail+0x45) [0x557c3b724745]
  0x557c3b6cb1e1: /usr/sbin/varnishd(VRT_AddDirector+0x3d1) [0x557c3b6cb1e1]
  0x557c3b67c023: /usr/sbin/varnishd(VRT_new_backend_clustered+0x4c3) [0x557c3b67c023]
  0x7fc4469918b8: ./vmod_cache/_vmod_dynamic.29b766e616262ab0ef9252419367ce48fbf642168ba34fc382b1742a77dac5c4(+0x68b8) [0x7fc4469918b8]
  0x7fc45f817064: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8064) [0x7fc45f817064]
  0x7fc45f54c62d: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fc45f54c62d]

Call stack is probably: dom_lookup_thread -> dom_update -> ref_add -> VRT_new_backend -> VRT_new_backend_clustered -> VRT_AddDirector

Could there possibly be a race condition between the lookup thread that could take a while to look up DNS records, then try to insert the new backends while the VCL was unloaded/set to cooling?

nigoroll commented 9 months ago

IRC log ftr:

(16:12:40) cc0: slink: about the libvmod-dynamic issue with cold/cooling while processing the lookup thread results. would you have an estimate of when you'd be able to work on a fix? in the meantime, since this affects my production i've started to make a fix. (16:12:40) cc0: following your suggestion i tried taking a ref to the VCL while the lookup thread was up, ie: VCL_Prevent_Cold when the lookup thread starts, then VCL_Allow_Cold when the lookup thread stops; but i get an assert saying that vcl->busy should be >0 but is 0. so i suppose i should only VCL_Prevent_Cold later, when i am warm, but the patch gets tricky. (16:12:40) cc0: (and also, IIUC, we wouldn't even know we have to go Done until all the domains time out, which means the VCL would stay busy for up to max(domain_usage_timeout), which sounds like too much?) (16:12:40) cc0: another way to address this (less clean?) would be to skip processing if we are done i suppose; so i added if (dom->status == DYNAMIC_ST_DONE) continue; in dom_update in the loop processing the dns results, and it seems to pass existing tests + mine. it looks suspiciously "too simple" though. what do you think? ... (18:22:36) slink: cc0 I hope next week (18:25:09) slink: cc0 the DYNAMIC_ST_DONE sounds like it could serve as a workaround. upfront I would think it was still racy, but I would need to look. (18:25:09) slink: you discovered one additional consideration with taking the vcl ref, but a more fundamental one is that creating directors can fail during VCL_COOLING (return null). at the moment the code assumes that missing directors will eventually appear. Again, this was only from my first look at it, I need to spend a couple of hours on this.

nigoroll commented 9 months ago

I think the root cause here is actually the problem I mentioned in https://github.com/nigoroll/libvmod-dynamic/issues/109#issuecomment-1912410690: With the wholesale VRT_VCL_Prevent_Discard() for the vmod, we rely on the vmod/director event dispatch order being correct. I will have another look if we can use more vcl references...

nigoroll commented 9 months ago

This affects 7.4 only and because I did not want to keep workarounds in the main branch, I have now created https://github.com/nigoroll/libvmod-dynamic/tree/7.4

delthas commented 5 months ago

I'm getting the same assert again, but in a slightly different way (even with the fix).

The stack is still:

Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x5f26789d8ebe sp=0x70bc775fe870 <pan_ic+0x34e>
  ip=0x5f2678aa5985 sp=0x70bc775fe9e0 <VAS_Fail+0x55>
  ip=0x5f2678a0ee86 sp=0x70bc775fea30 <VRT_AddDirector+0x956>
  ip=0x5f267899527d sp=0x70bc775feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x5f2678995d22 sp=0x70bc775fec80 <VRT_new_backend+0x132>
  ip=0x70bc82abfe61 sp=0x70bc775fecb0 <ref_add+0x660>
  ip=0x70bc82ac0acc sp=0x70bc775ff1d0 <dom_update+0xb69>
  ip=0x70bc82ac1395 sp=0x70bc775ff340 <dom_lookup_thread+0x485>
  ip=0x70bc99399ded sp=0x70bc775ff450
  ip=0x70bc9941d0dc sp=0x70bc775ff500

I'm stressing Varnish + libvmod-dynamic with a lot of VCL reloads. Everything usually works fine for a while (sometimes hours), until Varnish starts to loop on this error, every ~15s to ~1 minute. So, no issues for hours and then once I get this panic, it repeats every ~15s to ~1min: child crashes, child restarts, child crashes, ...

Here's an excerpt of the full logs:

Debug: Child (606349) Started
Child launched OK
Info: Child (606349) said Child starts
Error: Child (606349) not responding to CLI, killed it.
Error: Child (606349) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error (hdr)
Error: Child (606349) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (606349) died signal=3 (core dumped)
Debug: Child cleanup complete
Debug: Child (645026) Started
Child launched OK
Info: Child (645026) said Child starts
Error: Child (645026) not responding to CLI, killed it.
Error: Child (645026) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (645026) died signal=6 (core dumped)
Error: Child (645026) Panic at: Fri, 24 May 2024 14:58:32 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e6579fe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e6579fe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e6579fea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e6579feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e6579fec80 <VRT_new_backend+0x132>
  ip=0x76e661960e61 sp=0x76e6579fecb0 <ref_add+0x660>
  ip=0x76e661961acc sp=0x76e6579ff1d0 <dom_update+0xb69>
  ip=0x76e661962395 sp=0x76e6579ff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e6579ff450
  ip=0x76e676ebe0dc sp=0x76e6579ff500
Debug: Child cleanup complete
Debug: Child (649299) Started
Child launched OK
Info: Child (649299) said Child starts
Error: Child (649299) not responding to CLI, killed it.
Error: Child (649299) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (649299) died signal=6 (core dumped)
Error: Child (649299) Panic at: Fri, 24 May 2024 14:58:46 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e654bfe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e654bfe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e654bfea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e654bfeb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e654bfec80 <VRT_new_backend+0x132>
  ip=0x76e661686e61 sp=0x76e654bfecb0 <ref_add+0x660>
  ip=0x76e661687acc sp=0x76e654bff1d0 <dom_update+0xb69>
  ip=0x76e661688395 sp=0x76e654bff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e654bff450
  ip=0x76e676ebe0dc sp=0x76e654bff500
Debug: Child cleanup complete
Debug: Child (653709) Started
Child launched OK
Info: Child (653709) said Child starts
Error: Child (653709) not responding to CLI, killed it.
Error: Child (653709) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (653709) died signal=6 (core dumped)
Error: Child (653709) Panic at: Fri, 24 May 2024 14:59:01 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e6523fe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e6523fe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e6523fea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e6523feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e6523fec80 <VRT_new_backend+0x132>
  ip=0x76e66099fe61 sp=0x76e6523fecb0 <ref_add+0x660>
  ip=0x76e6609a0acc sp=0x76e6523ff1d0 <dom_update+0xb69>
  ip=0x76e6609a1395 sp=0x76e6523ff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e6523ff450
  ip=0x76e676ebe0dc sp=0x76e6523ff500
Debug: Child cleanup complete
Debug: Child (658076) Started
Child launched OK
Info: Child (658076) said Child starts
Error: Child (658076) not responding to CLI, killed it.
Error: Child (658076) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (658076) died signal=6 (core dumped)
Error: Child (658076) Panic at: Fri, 24 May 2024 14:59:21 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e655ffe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e655ffe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e655ffea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e655ffeb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e655ffec80 <VRT_new_backend+0x132>
  ip=0x76e6617a5e61 sp=0x76e655ffecb0 <ref_add+0x660>
  ip=0x76e6617a6acc sp=0x76e655fff1d0 <dom_update+0xb69>
  ip=0x76e6617a7395 sp=0x76e655fff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e655fff450
  ip=0x76e676ebe0dc sp=0x76e655fff500
Debug: Child cleanup complete
Debug: Child (662441) Started
Child launched OK
Info: Child (662441) said Child starts
Error: Child (662441) not responding to CLI, killed it.
Error: Child (662441) not responding to CLI, killed it.
Error: Unexpected reply from ping: 400 CLI communication error
Error: Child (662441) died signal=6 (core dumped)
Error: Child (662441) Panic at: Fri, 24 May 2024 14:59:38 GMT
Wrong turn at cache/cache_vrt_vcl.c:251:
Dynamic Backends can only be added to warm VCLs
Backtrace:
  ip=0x6281a32f6ebe sp=0x76e6567fe870 <pan_ic+0x34e>
  ip=0x6281a33c3985 sp=0x76e6567fe9e0 <VAS_Fail+0x55>
  ip=0x6281a332ce86 sp=0x76e6567fea30 <VRT_AddDirector+0x956>
  ip=0x6281a32b327d sp=0x76e6567feb80 <VRT_new_backend_clustered+0xc0d>
  ip=0x6281a32b3d22 sp=0x76e6567fec80 <VRT_new_backend+0x132>
  ip=0x76e662eaae61 sp=0x76e6567fecb0 <ref_add+0x660>
  ip=0x76e662eabacc sp=0x76e6567ff1d0 <dom_update+0xb69>
  ip=0x76e662eac395 sp=0x76e6567ff340 <dom_lookup_thread+0x485>
  ip=0x76e676e3aded sp=0x76e6567ff450
  ip=0x76e676ebe0dc sp=0x76e6567ff500
Debug: Child cleanup complete

It continued crashing in a loop indefinitely for multiple days. I managed to attach to the process and the VCL was indeed COOL at the time of the panic.