Closed ehocdet closed 5 years ago
Can you check how many files you end up having in the workdir and the subdirs of it ?
We may have to segregate VSM files into subdirectories if the directory search time becomes an issue
@bsdphk less than 50 files in _.vsm_child
the bigger one.
When a VCL is discarded, it looks like a write on all segments on all clusters from vsm_child: a least 5000 (for backend's vsc in big vsc_cluster) * number of vcl, and it take a lot of time.
I hope this commit helps ?
I'm not keen on doing too radical things a week before a release...
I tested and deployed the patch in our production: It's better, thanks! vcl.discard still takes time because many indexes are updated for StatDoc [see strace], but the impact is now more easily manageable.
[...]
renameat(20, "_._.index.117a483319dd1694", 20, "_.index") = 0
openat(20, "_._.index.0b42867c4badd96a", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.0b42867c4badd96a", O_WRONLY|O_CREAT|O_EXCL, 0640) = 27
write(27, "# 9967 1536583822\n", 18) = 18
write(27, "_.StatDoc.2f96b5f504150d2f 0 291"..., 422909) = 422909
close(27) = 0
[...]
renameat(20, "_._.index.245a823c0ff3b9a6", 20, "_.index") = 0
openat(20, "_._.index.798a4c767836d6bd", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.798a4c767836d6bd", O_WRONLY|O_CREAT|O_EXCL, 0640) = 28
write(28, "# 9967 1536583822\n", 18) = 18
write(28, "_.StatDoc.2f96b5f504150d2f 0 291"..., 358741) = 358741
close(28) = 0
renameat(20, "_._.index.798a4c767836d6bd", 20, "_.index") = 0
openat(20, "_._.index.755c0c4d67aec6cc", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.755c0c4d67aec6cc", O_WRONLY|O_CREAT|O_EXCL, 0640) = 28
write(28, "# 9967 1536583822\n", 18) = 18
write(28, "_.StatDoc.2f96b5f504150d2f 0 291"..., 358660) = 358660
close(28) = 0
renameat(20, "_._.index.755c0c4d67aec6cc", 20, "_.index") = 0
openat(20, "_._.index.0e8200370622aba1", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.0e8200370622aba1", O_WRONLY|O_CREAT|O_EXCL, 0640) = 28
write(28, "# 9967 1536583822\n", 18) = 18
write(28, "_.StatDoc.2f96b5f504150d2f 0 291"..., 358579) = 358579
close(28) = 0
renameat(20, "_._.index.0e8200370622aba1", 20, "_.index") = 0
openat(20, "_._.index.5c2d40790ff51bf3", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.5c2d40790ff51bf3", O_WRONLY|O_CREAT|O_EXCL, 0640) = 28
write(28, "# 9967 1536583822\n", 18) = 18
write(28, "_.StatDoc.2f96b5f504150d2f 0 291"..., 358498) = 358498
close(28) = 0
renameat(20, "_._.index.5c2d40790ff51bf3", 20, "_.index") = 0
munmap(0x7f92c84a6000, 446464) = 0
unlinkat(20, "_.VSC_cluster.453671847868b5a7", 0) = 0
openat(20, "_._.index.19fcd2122516718f", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.19fcd2122516718f", O_WRONLY|O_CREAT|O_EXCL, 0640) = 28
write(28, "# 9967 1536583822\n", 18) = 18
write(28, "_.StatDoc.2f96b5f504150d2f 0 291"..., 358416) = 358416
close(28) = 0
renameat(20, "_._.index.19fcd2122516718f", 20, "_.index") = 0
munmap(0x7f92c1113000, 3066112) = 0
I will leave it here for now, too close to release.
Looks like there is some rare problem after this fix (tested up to #5dc531f). Usage is the same as TS has: a lot of backends and often load/discard.
However, on most servers everything is perfect. I couldn't reproduce problem on test-server. It happens only on very small amount of production servers with minor differences, having very similar load and most other conditions. Far from any overload, lib/varnish mounted to RAM. Could it be connected to any system libraries versions? I've tried different kernels, tried to update dependies and gcc to latest sid-versions (Debian) and recompile... Probably, something is too new or the reason is completely different.
Is this still a problem ?
I use 6.0.2 No stucked VCLs. Events "not responding to CLI, killed it" happen but very rare and only on low-end servers with significant load.
At the moment I started migration to current master, I'll report when it will be tested enough
Master (#6ab2612): 4 days under load, everything perfect Thank you for your hard work )
Same as my last comment: vcl.discard still takes time and increase heavily with number of vcl (mostly in busy state). vcl.load may block a long time when the delayed discard is effective. cli_timeout still needs a significant value. To workaround that: vcl.load, vcl.state cold the old one, and vcl.discard one vcl in state cold/cold.
Update. Tested with 6.0.3 and master #a507232 on big amount of different servers.
In 99% cases it works completely OK. However, on some servers it has really critical issues and probably I'd have to downgrade (to 6.0.2 or a bit older).
Maybe it is same case as one of described (but with unclear logs) or here some another reasons:
04:47:16 varnishd CLI telnet 127.0.0.1 29508 127.0.0.1 7090 Rd vcl.discard v9d5e1bdc-6d41-4965-9673-6c5f0e7227aa
04:47:44 varnishd CLI telnet 127.0.0.1 29508 127.0.0.1 7090 Wr 200
...
04:48:01 varnishd CLI telnet 127.0.0.1 38140 127.0.0.1 7090 Rd vcl.load v95ff68b1-da10-49bf-a2cf-4856a7eb07e2 /etc/varnish/user.vcl
04:48:26 varnishd Child (24194) died signal=15
04:48:26 varnishd Child cleanup complete
04:48:26 varnishd manager stopping child
04:48:26 varnishd manager dies
04:48:26 systemd varnish.service: Main process exited, code=exited, status=64/USAGE
04:48:26 systemd varnish.service: Failed with result 'exit-code'.
04:48:26 systemd Reload failed for Varnish Cache, a high-performance HTTP accelerator.
@acachy I think 1) deserves a separate issue. The only idea I got at this point was to inspect a core dump in order to find out who is holding the VCL references and if rightly so.
Regarding 2/3 it would be interesting to learn if the patch from #2836 changes anything.
@nigoroll I've tried now #2836 (in merge with #cc3cd17, not sure if it's correct way to use)
Amount of crashes increased significantly. It looks like this:
20:54:27 varnishd CLI telnet 127.0.0.1 40734 127.0.0.1 7090 Rd vcl.load v83e3e6ff-04fa-46d5-9d03-907d24159cbd /etc/varnish/user.vcl 20:54:47 varnishd CLI telnet 127.0.0.1 40734 127.0.0.1 7090 Wr 200 VCL compiled. 20:54:47 varnishd CLI telnet 127.0.0.1 41598 127.0.0.1 7090 Rd vcl.use v83e3e6ff-04fa-46d5-9d03-907d24159cbd 20:54:47 varnishd CLI telnet 127.0.0.1 41598 127.0.0.1 7090 Wr 200 VCL 'v83e3e6ff-04fa-46d5-9d03-907d24159cbd' now active 20:55:33 varnishd CLI telnet 127.0.0.1 44226 127.0.0.1 7090 Rd vcl.list 20:55:34 varnishd CLI telnet 127.0.0.1 44244 127.0.0.1 7090 Rd vcl.discard vb8ce8b5d-181f-446c-a570-6e4f2facc1f7 20:55:34 varnishd Child (10666) not responding to CLI, killed it. 20:55:34 varnishd CLI telnet 127.0.0.1 44244 127.0.0.1 7090 Wr 200 20:55:34 varnishd Child (10666) died signal=6 20:55:34 varnishd Child (10666) Panic at: Thu, 11 Apr 2019 20:55:34 GMT (edit:reformatted)
Assert error in vsmw_delseg(), common/common_vsmw.c line 214: Condition(s == VSB_len(vsmw->vsb)) not true. version = varnish-trunk revision 669e8dbebaea8576e6e96384839a682d2b1a2b08, vrt api = 9.0 ident = Linux,4.17.0-3-amd64,x86_64,-junix,-sfile,-sfile,-smalloc,-smalloc,-sdefault,-hcritbit,epoll now = 16816994.075920 (mono), 1555016134.130014 (real) Backtrace: 0x555c3d1d41fd: /usr/sbin/varnishd(+0x531fd) [0x555c3d1d41fd] 0x555c3d246713: /usr/sbin/varnishd(VAS_Fail+0x13) [0x555c3d246713] 0x555c3d1fccdc: /usr/sbin/varnishd(+0x7bcdc) [0x555c3d1fccdc] 0x555c3d1fcf76: /usr/sbin/varnishd(VSMW_DestroyCluster+0x66) [0x555c3d1fcf76] 0x7f8e646a7862: vcl_vb8ce8b5d-181f-446c-a570-6e4f2facc1f7.1554993757.124461/vgc.so(+0x31f862) [0x7f8e646a7862] 0x7f8e646a78dd: vcl_vb8ce8b5d-181f-446c-a570-6e4f2facc1f7.1554993757.124461/vgc.so(+0x31f8dd) [0x7f8e646a78dd] 0x555c3d1e4f2c: /usr/sbin/varnishd(+0x63f2c) [0x555c3d1e4f2c] 0x555c3d1e7b3a: /usr/sbin/varnishd(VCL_Poll+0x12a) [0x555c3d1e7b3a] 0x555c3d248632: /usr/sbin/varnishd(+0xc7632) [0x555c3d248632] 0x555c3d248abf: /usr/sbin/varnishd(VCLS_Poll+0x35f) [0x555c3d248abf] thread = (cache-main) pthread.attr = { guard = 0, stack_bottom = 0x7fff55194000, stack_top = 0x7fff55992000, stack_size = 8380416, } thr.req = (nil) { }, thr.busyobj = (nil) { }, vmods = { std = {0x7fec312e90e0, Varnish trunk 669e8dbebaea8576e6e96384839a682d2b1a2b08, 0.0}, directors = {0x7fec312e9150, Varnish trunk 669e8dbebaea8576e6e96384839a682d2b1a2b08, 0.0}, },
> 20:55:34 varnishd Child cleanup complete
> 20:55:34 varnishd Child (12070) Started
@acachy thank you for testing and the panic info, this is helpful information and I will get back to it
@acachy is this still with cli_timeout = 900s
?
I did not change anything to lower the timeout required, I was just curios if the "parallel CLI" caused crashes go away.
- service varnish reload &
- varnishadm vcl.discard boot Rejected 400 CLI communication error (hdr) Rejected 400 CLI communication error (hdr)
- varnishadm vcl.discard boot
Job for varnish.service failed because the control process exited with error code.
...Child (16564) not responding to CLI, killed it.
Update. About vsmw_delseg() panic. It is not related to your patch, happens on clean master too (#cc3cd17).
It happens only on that "strange" server with stucked VCLs. And probably it's just one more type of "out of space" problem for /var/lib/varnish
I catch the server several times having low space (20mb of 650mb free) and falling in infinite loop "discard->panic->child restart":
00:21:31 varnishd Child cleanup complete 00:21:31 varnishd Child (16183) Started 00:21:53 varnishd Child (16183) said Child starts 00:21:53 varnishd Child (16183) said SMF.medium0 mmap'ed 201326592000 bytes of 201326592000 00:21:53 varnishd Child (16183) said SMF.medium1 mmap'ed 201326592000 bytes of 201326592000 00:24:24 varnishd CLI telnet 127.0.0.1 65226 127.0.0.1 7090 Rd vcl.load v01c22d63-6f1c-4d92-b188-8d60094ee84d /etc/varnish/user.vcl 00:24:44 varnishd CLI telnet 127.0.0.1 65226 127.0.0.1 7090 Wr 200 VCL compiled. 00:24:44 varnishd CLI telnet 127.0.0.1 7532 127.0.0.1 7090 Rd vcl.use v01c22d63-6f1c-4d92-b188-8d60094ee84d 00:24:44 varnishd CLI telnet 127.0.0.1 7532 127.0.0.1 7090 Wr 200 VCL 'v01c22d63-6f1c-4d92-b188-8d60094ee84d' now active 00:25:30 varnishd CLI telnet 127.0.0.1 9486 127.0.0.1 7090 Rd vcl.list 00:25:30 varnishd CLI telnet 127.0.0.1 9486 127.0.0.1 7090 Wr 200 available auto cold 0 v340b7e63-d70f-4985-932d-80d740aea930#012available auto cold 0 v46927630-8bd4-4024-b3b2-9d32216df178#012available auto cold 0 vf0285796-08ce-4221-942d-a5ce5693d955#012available auto cold 0 v0ce7e43c-2872-4065-805f-463b1c1194aa#012available auto cold 0 v6c4f4613-c2c9-4ef2-98b1-22bfaf3cce62#012available auto cold 0 v49c4ca17-00c2-4b95-a3b1-9d03e37078a3#012available auto cold 0 v116ed0f7-abda-4b14-bb8d-c7c533da7bc3#012available auto cold 0 vb0167cd5-b42d-412b-bb97-831b452428eb#012available auto cold 0 v928b79bf-3b0f-40ed-8abc-2a7fa4583169#012available auto cold 0 v57286244-d482-4db9-9fdb-857d293b99f2#012available auto cold 0 v96dca88a-10fa-42bd-97f4-b1dcf77551e3#012available auto cold 0 vdda5c30b-fc5c-4510-9de2-5c51ea763d23#012available auto cold 0 vabe59324-eaec-4100-ae5c-9219054f8ea3#012available auto cold 0 v9c1b85d5-11f0-4d05-9fee-9a9a92c8fede#012available auto cold 0 v7aa03358-304a-41f3-84b5-c9f45aaf5ba3#012available auto cold 0 va0cc69af-dbd4-4625-a8c5-2cd712176ff1#012available auto cold 0 vf02e19af-4aec-4c44-9f2a-0627dcf8e9c4#012available auto cold 0 ve4b4eeac-2cb5-4c64-801f-72290d038c3f#012available auto cold 0 vdc2e10bd-64c8-48e2-8f44-f1df407b8db1#012available auto cold 0 v760936a3-203c-4502-8279-c14f5cfc6043#012available auto cold 0 v3d92cc63-1e8d-48a1-9807-61b8a8d05a1d#012available auto cold 0 vd769414b-d542-4049-b43b-0a01a3386d9e#012available auto cold 0 vdc495745-20f7-42cb-be29-0387db5af1b8#012available auto cold 0 vf6e6bdc5-257b-4e06-967f-18646956388e#012available auto cold 0 vd1001f8e-3590-41ed-91fb-681e5ed51b28#012available auto cold 0 va912a70e-eb23-420c-b48c-45a46c5b843d#012available auto cold 0 v6f4358b5-2299-456b-a3e7-05e4da777bb1#012available auto cold 0 v8f2fb3e7-aeb7-441f-9013-4aa33c96c3e0#012available auto cold 0 v6f5036e1-1d7d-43ad-a9cb-1b02ea7684cf#012available auto cold 0 v6da104e1-57e0-437c-80f0-031feee9e81b#012available auto cold 0 v169b9b09-8200 00:25:30 varnishd CLI telnet 127.0.0.1 9492 127.0.0.1 7090 Rd vcl.discard v340b7e63-d70f-4985-932d-80d740aea930 00:25:31 varnishd Child (16183) not responding to CLI, killed it. 00:25:31 varnishd CLI telnet 127.0.0.1 9492 127.0.0.1 7090 Wr 200 00:25:31 varnishd Child (16183) died signal=6 00:25:31 varnishd Child (16183) Panic at: Wed, 17 Apr 2019 00:25:31 GMT#012Assert error in vsmw_delseg(), common/common_vsmw.c line 214:#012 Condition(s == VSB_len(vsmw->vsb)) not true.#012version = varnish-trunk revision cc3cd17b51e289822fcdc46b76d03fd5d48e632d, vrt api = 9.0#012ident = Linux,4.19.0-4-amd64,x86_64,-junix,-sfile,-sfile,-smalloc,-smalloc,-sdefault,-hcritbit,epoll#012now = 411231.432136 (mono), 1555460731.007034 (real)#012Backtrace:#012 0x55b7233ca11d: /usr/sbin/varnishd(+0x5311d) [0x55b7233ca11d]#012 0x55b72343bee3: /usr/sbin/varnishd(VAS_Fail+0x13) [0x55b72343bee3]#012 0x55b7233f2b7c: /usr/sbin/varnishd(+0x7bb7c) [0x55b7233f2b7c]#012 0x55b7233f2e16: /usr/sbin/varnishd(VSMW_DestroyCluster+0x66) [0x55b7233f2e16]#012 0x7f579cea69a9: vcl_v340b7e63-d70f-4985-932d-80d740aea930.1555444806.194502/vgc.so(+0x3209a9) [0x7f579cea69a9]#012 0x7f579cea6a24: vcl_v340b7e63-d70f-4985-932d-80d740aea930.1555444806.194502/vgc.so(+0x320a24) [0x7f579cea6a24]#012 0x55b7233dae0c: /usr/sbin/varnishd(+0x63e0c) [0x55b7233dae0c]#012 0x55b7233dda1a: /usr/sbin/varnishd(VCL_Poll+0x12a) [0x55b7233dda1a]#012 0x55b72343de02: /usr/sbin/varnishd(+0xc6e02) [0x55b72343de02]#012 0x55b72343e28f: /usr/sbin/varnishd(VCLS_Poll+0x35f) [0x55b72343e28f]#012thread = (cache-main)#012pthread.attr = {#012 guard = 0,#012 stack_bottom = 0x7ffed88bd000,#012 stack_top = 0x7ffed90bc000,#012 stack_size = 8384512,#012}#012thr.req = (nil) {#012},#012thr.busyobj = (nil) {#012},#012vmods = {#012 std = {0x7fb569ee90e0, Varnish trunk cc3cd17b51e289822fcdc46b76d03fd5d48e632d, 0.0},#012 directors = {0x7fb569ee9150, Varnish trunk cc3cd17b51e289822fcdc46b76d03fd5d48e632d, 0.0},#012},#012 00:25:31 varnishd Child cleanup complete 00:25:31 varnishd Child (17388) Started 00:25:52 varnishd Child (17388) said Child starts 00:25:52 varnishd Child (17388) said SMF.medium0 mmap'ed 201326592000 bytes of 201326592000 00:25:52 varnishd Child (17388) said SMF.medium1 mmap'ed 201326592000 bytes of 201326592000 00:28:25 varnishd CLI telnet 127.0.0.1 16094 127.0.0.1 7090 Rd vcl.load vafb30477-4622-457b-90b2-f42bbd715191 /etc/varnish/user.vcl 00:28:44 varnishd CLI telnet 127.0.0.1 16094 127.0.0.1 7090 Wr 200 VCL compiled. 00:28:44 varnishd CLI telnet 127.0.0.1 16836 127.0.0.1 7090 Rd vcl.use vafb30477-4622-457b-90b2-f42bbd715191 00:28:44 varnishd CLI telnet 127.0.0.1 16836 127.0.0.1 7090 Wr 200 VCL 'vafb30477-4622-457b-90b2-f42bbd715191' now active 00:29:31 varnishd CLI telnet 127.0.0.1 18836 127.0.0.1 7090 Rd vcl.list 00:29:31 varnishd CLI telnet 127.0.0.1 18836 127.0.0.1 7090 Wr 200 available auto cold 0 v46927630-8bd4-4024-b3b2-9d32216df178#012available auto cold 0 vf0285796-08ce-4221-942d-a5ce5693d955#012available auto cold 0 v0ce7e43c-2872-4065-805f-463b1c1194aa#012available auto cold 0 v6c4f4613-c2c9-4ef2-98b1-22bfaf3cce62#012available auto cold 0 v49c4ca17-00c2-4b95-a3b1-9d03e37078a3#012available auto cold 0 v116ed0f7-abda-4b14-bb8d-c7c533da7bc3#012available auto cold 0 vb0167cd5-b42d-412b-bb97-831b452428eb#012available auto cold 0 v928b79bf-3b0f-40ed-8abc-2a7fa4583169#012available auto cold 0 v57286244-d482-4db9-9fdb-857d293b99f2#012available auto cold 0 v96dca88a-10fa-42bd-97f4-b1dcf77551e3#012available auto cold 0 vdda5c30b-fc5c-4510-9de2-5c51ea763d23#012available auto cold 0 vabe59324-eaec-4100-ae5c-9219054f8ea3#012available auto cold 0 v9c1b85d5-11f0-4d05-9fee-9a9a92c8fede#012available auto cold 0 v7aa03358-304a-41f3-84b5-c9f45aaf5ba3#012available auto cold 0 va0cc69af-dbd4-4625-a8c5-2cd712176ff1#012available auto cold 0 vf02e19af-4aec-4c44-9f2a-0627dcf8e9c4#012available auto cold 0 ve4b4eeac-2cb5-4c64-801f-72290d038c3f#012available auto cold 0 vdc2e10bd-64c8-48e2-8f44-f1df407b8db1#012available auto cold 0 v760936a3-203c-4502-8279-c14f5cfc6043#012available auto cold 0 v3d92cc63-1e8d-48a1-9807-61b8a8d05a1d#012available auto cold 0 vd769414b-d542-4049-b43b-0a01a3386d9e#012available auto cold 0 vdc495745-20f7-42cb-be29-0387db5af1b8#012available auto cold 0 vf6e6bdc5-257b-4e06-967f-18646956388e#012available auto cold 0 vd1001f8e-3590-41ed-91fb-681e5ed51b28#012available auto cold 0 va912a70e-eb23-420c-b48c-45a46c5b843d#012available auto cold 0 v6f4358b5-2299-456b-a3e7-05e4da777bb1#012available auto cold 0 v8f2fb3e7-aeb7-441f-9013-4aa33c96c3e0#012available auto cold 0 v6f5036e1-1d7d-43ad-a9cb-1b02ea7684cf#012available auto cold 0 v6da104e1-57e0-437c-80f0-031feee9e81b#012available auto cold 0 v169b9b09-8200-44a7-bd4c-ccbecfa0bab3#012available auto cold 0 v4fdaf50b-fee 00:29:32 varnishd CLI telnet 127.0.0.1 18856 127.0.0.1 7090 Rd vcl.discard v46927630-8bd4-4024-b3b2-9d32216df178 00:29:32 varnishd Child (17388) not responding to CLI, killed it.
After I've made /var/lib/varnish huge - there are no such crashes, just ~1GB eaten for the first day having 100+ old VCLs.
I've made complete OS reinstall to be sure that it is not some crazy old library, but everything is still the same. Hope I'll provide something useful further when succeed in dealing with core dump.
@acachy this makes perfect sense. I was already suspecting that we would run be running into an out-of-space condition, but I was wondering why we would not see errno
set for the short write in VSMW_DestroyCluster()
/ vsmw_delseg()
in the panic https://github.com/varnishcache/varnish-cache/issues/2764#issuecomment-482455093
But I simply overlooked the fact that the last (short) write succeeds and only a follow up attempt triggers an ENOSPC
as demonstrated easily on a small filesystem:
# truss dd if=/dev/zero of=t bs=123456
...
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 123456) = 123456
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 123456) = 123456
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 123456) = 60928
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 62528) = -1 ENOSPC (No space left on device)
So is the following an appropriate summary from your point of view?
ENOSPC
Is there anything missing in the list?
More Ideas:
vcl.explainref
command or similar explaining the references which hold a vcldomain_usage_timeout
change anythingConnection: close
client connections on the non-active VCLbroader idea (sort of unrelated)
idle_send_timeout
, send_timeout
, timeout_idle
, timeout_linger
) per session from VCL. This would allow to tune timeouts based on how trustworthy one particular client is (e.g. anonymous client vs. authenticated vs. CDN servers).Varnish should not panic for ENOSPC
- For all cases, including vcl.discard and vcl.load.
for visibility, we could maybe have a vcl.explainref command or similar explaining the references which hold a vcl
- Suddenly my "strange" server has stopped having stucked VCLs at all and looks like this rare bug wouldn't be investigated now. However, your solution with forced "Connection: close" might be a complete fix for this cases. So, this additional debug-feature may not be useful.
We should work on optimizing runtime with many backends
- Specifying - vcl.discard and varnishstat performance. Other things are OK for many backends configuration.
you mentioned dynamic backends in the initial note. Are you talking about https://github.com/nigoroll/libvmod-dynamic
- Initial note is not mine. I use only default modules (directors, std) and ~5k normal backends.
Extra: correct parallel execution of all CLI commands, including
@acachy do you have TCP_KEEPALIVE configured on these servers ?
As far as I know, the only way VCL's can be held for a long time is of worker threads are stuck waiting on clients which have gone away but never send a packet to say so. TCP_KEEPALIVE should fix that.
As for reference counts increasing after vcl.discard, that can happen with ESI requests in-flight when the vcl.discard is executed, but I know of no other cases where that should happen.
@bsdphk
As for reference counts increasing after vcl.discard, that can happen with ESI requests in-flight when the vcl.discard is executed, but I know of no other cases where that should happen.
ESI is not in use
As far as I know, the only way VCL's can be held for a long time is of worker threads are stuck waiting on clients which have gone away but never send a packet to say so. TCP_KEEPALIVE should fix that.
So it is very unusual case, but anyway there should be some bugs around.
There's been numerous updates to the VSM code to improve performance in the latest Varnish release. It is suspected that that will have improved the situation, and I will close this ticket.
Current Behavior
The issue can happen after a vcl.discard (the VCL is in cold/cold state)
Context
This issue follows #2518. Varnish 6 with 5k+ backends and one VSC_cluster for dynamic backends (init in vcl_ini). With VSC_cluster, vcl.load and VSC usages are fast (thanks for that).
vcl.load/vcl.discard is done very regularly, and the issue appears after a vcl.discard when more than 3 vcl are loaded. To limit the impact
cli_timout
is set to 360s. To isolate the issue, vcl.discard is done after vcl is in cold/cold state.Your Environment
. Version used: based on Varnish 6.0.0 from git + h2 fixes from master . Operating System and version: Linux 3.14.79 and 4.9.0
Debug:
strace child when vcl are discarded: