CLI timeout after vcl.discard with many backend (in one VSC_cluster)

ehocdet commented 6 years ago

Current Behavior

The issue can happen after a vcl.discard (the VCL is in cold/cold state)

Sep  5 09:01:07 webacc1 varnishd[3608]: CLI telnet **** 31480 ***** X Rd vcl.discard cache_1_20180905T07h40m25s
Sep  5 09:05:08 webacc1 varnishd[3608]: Child (3326) not responding to CLI, killed it.
Sep  5 09:05:08 webacc1 varnishd[3608]: CLI telnet **** 31480 **** X Wr 200 
Sep  5 09:05:08 webacc1 varnishd[3608]: Child (3326) not responding to CLI, killed it.
Sep  5 09:05:08 webacc1 varnishd[3608]: Unexpected reply from ping: 400 CLI communication error (hdr)
Sep  5 09:05:08 webacc1 varnishd[3608]: Child (3326) died signal=3

Context

This issue follows #2518. Varnish 6 with 5k+ backends and one VSC_cluster for dynamic backends (init in vcl_ini). With VSC_cluster, vcl.load and VSC usages are fast (thanks for that).

vcl.load/vcl.discard is done very regularly, and the issue appears after a vcl.discard when more than 3 vcl are loaded. To limit the impact cli_timout is set to 360s. To isolate the issue, vcl.discard is done after vcl is in cold/cold state.

Your Environment

. Version used: based on Varnish 6.0.0 from git + h2 fixes from master . Operating System and version: Linux 3.14.79 and 4.9.0

Debug:

strace child when vcl are discarded:

[...]
write(2420, "_.VSC_cluster.3b361cef4062598a 3"..., 105) = 105
write(2420, "_.VSC_cluster.3b361cef4062598a 3"..., 106) = 106
write(2420, "_.VSC_cluster.3b361cef4062598a 3"..., 105) = 105
close(2420)                             = 0
renameat(2393, "_._.index.719d7a460983fb48", 2393, "_.index") = 0
futex(0x7f6bb650e428, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7f6bb650e428, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7f6bb650e428, FUTEX_WAKE_PRIVATE, 1) = 1
openat(2393, "_._.index.713890cf5a9b9b40", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(2393, "_._.index.713890cf5a9b9b40", O_WRONLY|O_CREAT|O_EXCL, 0640) = 2540
write(2540, "# 10076 1535964159\n", 19) = 19
write(2540, "_.StatDoc.41d7b86d030c0183 0 291"..., 50) = 50
write(2540, "_.Stat.4b3914d46be4ec7e 0 1040 S"..., 41) = 41
write(2540, "_.Log.5fb8450e7fed8a49 0 2684354"..., 43) = 43
write(2540, "_.StatDoc.18f86aea790184a3 0 760"..., 48) = 48
write(2540, "_.Stat.19797b2719caac0d 0 48 Sta"..., 46) = 46
write(2540, "_.Stat.328b5a184365a917 0 48 Sta"..., 42) = 42
write(2540, "_.Stat.66f65baf690d73c2 0 48 Sta"..., 46) = 46
write(2540, "_.Stat.15a15f723b42c6af 0 48 Sta"..., 42) = 42
write(2540, "_.Stat.0df538b15d6cb067 0 48 Sta"..., 42) = 42
write(2540, "_.Stat.1ee9f28f7370410a 0 48 Sta"..., 42) = 42
write(2540, "_.Stat.095f88c769eb82bb 0 48 Sta"..., 42) = 42
write(2540, "_.Stat.548b76c02c3a4095 0 48 Sta"..., 46) = 46
write(2540, "_.Stat.369b369c6c3e34a7 0 48 Sta"..., 45) = 45
write(2540, "_.Stat.4590a47d77a45e02 0 48 Sta"..., 47) = 47
write(2540, "_.Stat.583039882d31c98a 0 48 Sta"..., 43) = 43
write(2540, "_.Stat.15fca7b91a07f1f5 0 48 Sta"..., 47) = 47
write(2540, "_.Stat.303dcb0d6135bc8d 0 48 Sta"..., 42) = 42
write(2540, "_.Stat.05ecde740ff6101b 0 48 Sta"..., 46) = 46
write(2540, "_.Stat.612346d71ee5495e 0 48 Sta"..., 42) = 42
write(2540, "_.Stat.08f794bf7a9cc1fe 0 48 Sta"..., 43) = 43
write(2540, "_.Stat.38aff56c3b82eed7 0 48 Sta"..., 45) = 45
write(2540, "_.Stat.3e026b161fa6511b 0 48 Sta"..., 41) = 41
write(2540, "_.Stat.2490629953a3ca88 0 48 Sta"..., 44) = 44
write(2540, "_.StatDoc.5ae917ca32859b4a 0 174"..., 49) = 49
write(2540, "_.Stat.31107aef79d30a5a 0 112 St"..., 51) = 51
write(2540, "_.Stat.25f5dc543a7003b6 0 112 St"..., 48) = 48
write(2540, "_.Stat.63be8d157a815314 0 112 St"..., 49) = 49
write(2540, "_.Stat.66aa444b1a59c3b1 0 48 Sta"..., 42) = 42
write(2540, "_.StatDoc.66bf87bb2c3ae8c8 0 154"..., 49) = 49
write(2540, "_.Stat.11fe21b33eefc143 0 80 Sta"..., 41) = 41
write(2540, "_.Stat.596cb25327fac96d 0 80 Sta"..., 48) = 48
write(2540, "_.Stat.0c4e41f5707b5cda 0 112 St"..., 48) = 48
write(2540, "_.Stat.6d1bd0255f1047e1 0 112 St"..., 49) = 49
write(2540, "_.StatDoc.10d370af2f7769bd 0 207"..., 49) = 49
write(2540, "_.Stat.11225b1d38db44e1 0 48 Sta"..., 57) = 57
write(2540, "_.VSC_cluster.0e316aae03324009 2"..., 105) = 105
write(2540, "_.VSC_cluster.0e316aae03324009 2"..., 105) = 105
write(2540, "_.VSC_cluster.0e316aae03324009 2"..., 105) = 105
[...]
write(2540, "_.VSC_cluster.0e316aae03324009 3"..., 105) = 105
write(2540, "_.VSC_cluster.0e316aae03324009 3"..., 105) = 105
write(2540, "_.VSC_cluster.0e316aae03324009 3"..., 106) = 106
write(2540, "_.VSC_cluster.0e316aae03324009 3"..., 105) = 105
write(2540, "_.VSC_cluster.247442e9473f10a8 0"..., 39) = 39
write(2540, "_.VSC_cluster.247442e9473f10a8 0"..., 79) = 79
write(2540, "_.VSC_cluster.40cd3e397fce04c8 0"..., 42) = 42
write(2540, "_.VSC_cluster.40cd3e397fce04c8 0"..., 101) = 101
write(2540, "_.VSC_cluster.40cd3e397fce04c8 1"..., 103) = 103
write(2540, "_.VSC_cluster.40cd3e397fce04c8 2"..., 103) = 103
[...]

bsdphk commented 6 years ago

Can you check how many files you end up having in the workdir and the subdirs of it ?

We may have to segregate VSM files into subdirectories if the directory search time becomes an issue

ehocdet commented 6 years ago

@bsdphk less than 50 files in _.vsm_child the bigger one. When a VCL is discarded, it looks like a write on all segments on all clusters from vsm_child: a least 5000 (for backend's vsc in big vsc_cluster) * number of vcl, and it take a lot of time.

bsdphk commented 6 years ago

I hope this commit helps ?

I'm not keen on doing too radical things a week before a release...

ehocdet commented 6 years ago

I tested and deployed the patch in our production: It's better, thanks! vcl.discard still takes time because many indexes are updated for StatDoc [see strace], but the impact is now more easily manageable.

[...]
renameat(20, "_._.index.117a483319dd1694", 20, "_.index") = 0
openat(20, "_._.index.0b42867c4badd96a", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.0b42867c4badd96a", O_WRONLY|O_CREAT|O_EXCL, 0640) = 27
write(27, "# 9967 1536583822\n", 18)    = 18
write(27, "_.StatDoc.2f96b5f504150d2f 0 291"..., 422909) = 422909
close(27)                               = 0
[...]
renameat(20, "_._.index.245a823c0ff3b9a6", 20, "_.index") = 0
openat(20, "_._.index.798a4c767836d6bd", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.798a4c767836d6bd", O_WRONLY|O_CREAT|O_EXCL, 0640) = 28
write(28, "# 9967 1536583822\n", 18)    = 18
write(28, "_.StatDoc.2f96b5f504150d2f 0 291"..., 358741) = 358741
close(28)                               = 0
renameat(20, "_._.index.798a4c767836d6bd", 20, "_.index") = 0
openat(20, "_._.index.755c0c4d67aec6cc", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.755c0c4d67aec6cc", O_WRONLY|O_CREAT|O_EXCL, 0640) = 28
write(28, "# 9967 1536583822\n", 18)    = 18
write(28, "_.StatDoc.2f96b5f504150d2f 0 291"..., 358660) = 358660
close(28)                               = 0
renameat(20, "_._.index.755c0c4d67aec6cc", 20, "_.index") = 0
openat(20, "_._.index.0e8200370622aba1", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.0e8200370622aba1", O_WRONLY|O_CREAT|O_EXCL, 0640) = 28
write(28, "# 9967 1536583822\n", 18)    = 18
write(28, "_.StatDoc.2f96b5f504150d2f 0 291"..., 358579) = 358579
close(28)                               = 0
renameat(20, "_._.index.0e8200370622aba1", 20, "_.index") = 0
openat(20, "_._.index.5c2d40790ff51bf3", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.5c2d40790ff51bf3", O_WRONLY|O_CREAT|O_EXCL, 0640) = 28
write(28, "# 9967 1536583822\n", 18)    = 18
write(28, "_.StatDoc.2f96b5f504150d2f 0 291"..., 358498) = 358498
close(28)                               = 0
renameat(20, "_._.index.5c2d40790ff51bf3", 20, "_.index") = 0
munmap(0x7f92c84a6000, 446464)          = 0
unlinkat(20, "_.VSC_cluster.453671847868b5a7", 0) = 0
openat(20, "_._.index.19fcd2122516718f", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(20, "_._.index.19fcd2122516718f", O_WRONLY|O_CREAT|O_EXCL, 0640) = 28
write(28, "# 9967 1536583822\n", 18)    = 18
write(28, "_.StatDoc.2f96b5f504150d2f 0 291"..., 358416) = 358416
close(28)                               = 0
renameat(20, "_._.index.19fcd2122516718f", 20, "_.index") = 0
munmap(0x7f92c1113000, 3066112)         = 0

bsdphk commented 6 years ago

I will leave it here for now, too close to release.

acachy commented 6 years ago

Looks like there is some rare problem after this fix (tested up to #5dc531f). Usage is the same as TS has: a lot of backends and often load/discard.

Connections counters for old VCLs are decreased very slow (according to vcl.list). It takes hours and even days to release some of discarded VCLs.
Varnishstat became almost unusable when a lot of such VCLs are stucked.

However, on most servers everything is perfect. I couldn't reproduce problem on test-server. It happens only on very small amount of production servers with minor differences, having very similar load and most other conditions. Far from any overload, lib/varnish mounted to RAM. Could it be connected to any system libraries versions? I've tried different kernels, tried to update dependies and gcc to latest sid-versions (Debian) and recompile... Probably, something is too new or the reason is completely different.

bsdphk commented 5 years ago

Is this still a problem ?

acachy commented 5 years ago

I use 6.0.2 No stucked VCLs. Events "not responding to CLI, killed it" happen but very rare and only on low-end servers with significant load.

At the moment I started migration to current master, I'll report when it will be tested enough

acachy commented 5 years ago

Master (#6ab2612): 4 days under load, everything perfect Thank you for your hard work )

ehocdet commented 5 years ago

Same as my last comment: vcl.discard still takes time and increase heavily with number of vcl (mostly in busy state). vcl.load may block a long time when the delayed discard is effective. cli_timeout still needs a significant value. To workaround that: vcl.load, vcl.state cold the old one, and vcl.discard one vcl in state cold/cold.

acachy commented 5 years ago

Update. Tested with 6.0.3 and master #a507232 on big amount of different servers.

In 99% cases it works completely OK. However, on some servers it has really critical issues and probably I'd have to downgrade (to 6.0.2 or a bit older).

On few servers some VCLs are stuck in "available cold busy" state for many hours.

It makes related operations slow and "/var/lib/varnish" could grow even over 500MB (leading to crash on next reload, because I use small tmpfs for it).
The difference between these few "bad" servers and other 99% is huge. "Bad" servers could collect 20-30+ old vcls while "good" servers have <=1 (with similar load level and larger uptime).
Most stucked VCLs have just 1-5 connections in vcl.list.
Interesting detail. Comparing multiple runs of vcl.list I noticed that connections counter sometimes increase for very old VCLs. Looks like it might be some crazy requests that goes to wrong VCL and make them stucked. Is there any way to check details about that connections?
I've implemented call "vcl.state ... cold" for all busy VCLs, but looks like it makes no difference.

Crashes come very easy with parallel commands.

I have CLI timeout 900s.
When discard\reload and other CLI-operations intersects with each other, sometimes I get "CLI timeout" message and crash in ~60 seconds.
The most common case: discard command => wait 1 minute => reload command (some discard takes 5+ minutes, so still working at this moment) => wait 1 minute => timeout-crash
After I've strictly prohibited parallel execution of any varnishadm\varnishstat\reload commands - crashes became rare (1-2 times per day on the "worst" server)

Some crashes I still don't understand

Maybe it is same case as one of described (but with unclear logs) or here some another reasons:

04:47:16 varnishd  CLI telnet 127.0.0.1 29508 127.0.0.1 7090 Rd vcl.discard v9d5e1bdc-6d41-4965-9673-6c5f0e7227aa
04:47:44 varnishd  CLI telnet 127.0.0.1 29508 127.0.0.1 7090 Wr 200
...
04:48:01 varnishd  CLI telnet 127.0.0.1 38140 127.0.0.1 7090 Rd vcl.load v95ff68b1-da10-49bf-a2cf-4856a7eb07e2 /etc/varnish/user.vcl
04:48:26 varnishd  Child (24194) died signal=15
04:48:26 varnishd  Child cleanup complete
04:48:26 varnishd  manager stopping child
04:48:26 varnishd  manager dies
04:48:26 systemd  varnish.service: Main process exited, code=exited, status=64/USAGE
04:48:26 systemd  varnish.service: Failed with result 'exit-code'.
04:48:26 systemd  Reload failed for Varnish Cache, a high-performance HTTP accelerator.

Varnishstat is also funny on "bad" servers.
- Almost impossible to use manually - it stays many minutes without any output
- Simple command "varnishstat -f MAIN.uptime -1" takes ~5 minutes and uses 100% CPU-core all the time + up to several hundreds MB RAM.
- It even couldn't be killed normally - only SIGTERM helps

nigoroll commented 5 years ago

@acachy I think 1) deserves a separate issue. The only idea I got at this point was to inspect a core dump in order to find out who is holding the VCL references and if rightly so.

Regarding 2/3 it would be interesting to learn if the patch from #2836 changes anything.

acachy commented 5 years ago

@nigoroll I've tried now #2836 (in merge with #cc3cd17, not sure if it's correct way to use)

Amount of crashes increased significantly. It looks like this:

20:54:27 varnishd CLI telnet 127.0.0.1 40734 127.0.0.1 7090 Rd vcl.load v83e3e6ff-04fa-46d5-9d03-907d24159cbd /etc/varnish/user.vcl 20:54:47 varnishd CLI telnet 127.0.0.1 40734 127.0.0.1 7090 Wr 200 VCL compiled. 20:54:47 varnishd CLI telnet 127.0.0.1 41598 127.0.0.1 7090 Rd vcl.use v83e3e6ff-04fa-46d5-9d03-907d24159cbd 20:54:47 varnishd CLI telnet 127.0.0.1 41598 127.0.0.1 7090 Wr 200 VCL 'v83e3e6ff-04fa-46d5-9d03-907d24159cbd' now active 20:55:33 varnishd CLI telnet 127.0.0.1 44226 127.0.0.1 7090 Rd vcl.list 20:55:34 varnishd CLI telnet 127.0.0.1 44244 127.0.0.1 7090 Rd vcl.discard vb8ce8b5d-181f-446c-a570-6e4f2facc1f7 20:55:34 varnishd Child (10666) not responding to CLI, killed it. 20:55:34 varnishd CLI telnet 127.0.0.1 44244 127.0.0.1 7090 Wr 200 20:55:34 varnishd Child (10666) died signal=6 20:55:34 varnishd Child (10666) Panic at: Thu, 11 Apr 2019 20:55:34 GMT (edit:reformatted)
Assert error in vsmw_delseg(), common/common_vsmw.c line 214:
Condition(s == VSB_len(vsmw->vsb)) not true.
version = varnish-trunk revision 669e8dbebaea8576e6e96384839a682d2b1a2b08, vrt api = 9.0
ident = Linux,4.17.0-3-amd64,x86_64,-junix,-sfile,-sfile,-smalloc,-smalloc,-sdefault,-hcritbit,epoll
now = 16816994.075920 (mono), 1555016134.130014 (real)
Backtrace:
0x555c3d1d41fd: /usr/sbin/varnishd(+0x531fd) [0x555c3d1d41fd]
0x555c3d246713: /usr/sbin/varnishd(VAS_Fail+0x13) [0x555c3d246713]
0x555c3d1fccdc: /usr/sbin/varnishd(+0x7bcdc) [0x555c3d1fccdc]
0x555c3d1fcf76: /usr/sbin/varnishd(VSMW_DestroyCluster+0x66) [0x555c3d1fcf76]
0x7f8e646a7862: vcl_vb8ce8b5d-181f-446c-a570-6e4f2facc1f7.1554993757.124461/vgc.so(+0x31f862) [0x7f8e646a7862]
0x7f8e646a78dd: vcl_vb8ce8b5d-181f-446c-a570-6e4f2facc1f7.1554993757.124461/vgc.so(+0x31f8dd) [0x7f8e646a78dd]
0x555c3d1e4f2c: /usr/sbin/varnishd(+0x63f2c) [0x555c3d1e4f2c]
0x555c3d1e7b3a: /usr/sbin/varnishd(VCL_Poll+0x12a) [0x555c3d1e7b3a]
0x555c3d248632: /usr/sbin/varnishd(+0xc7632) [0x555c3d248632]
0x555c3d248abf: /usr/sbin/varnishd(VCLS_Poll+0x35f) [0x555c3d248abf]
thread = (cache-main)
pthread.attr = {
guard = 0,
stack_bottom = 0x7fff55194000,
stack_top = 0x7fff55992000,
stack_size = 8380416,
}
thr.req = (nil) {
},
thr.busyobj = (nil) {
},
vmods = {
std = {0x7fec312e90e0, Varnish trunk 669e8dbebaea8576e6e96384839a682d2b1a2b08, 0.0},
directors = {0x7fec312e9150, Varnish trunk 669e8dbebaea8576e6e96384839a682d2b1a2b08, 0.0},
},


> 20:55:34 varnishd  Child cleanup complete
> 20:55:34 varnishd  Child (12070) Started

nigoroll commented 5 years ago

@acachy thank you for testing and the panic info, this is helpful information and I will get back to it

nigoroll commented 5 years ago

@acachy is this still with cli_timeout = 900s ? I did not change anything to lower the timeout required, I was just curios if the "parallel CLI" caused crashes go away.

acachy commented 5 years ago

Yes, timeout always 900s, here is my non-default settings https://pastebin.com/Kkd8jqfr
Parallel CLI is not 100% deadly. Mostly it just shows "Rejected 400". I've made an additional test for case "discard during reload". In this way it looks similar with\without patch.

service varnish reload &

varnishadm vcl.discard boot Rejected 400 CLI communication error (hdr) Rejected 400 CLI communication error (hdr)

varnishadm vcl.discard boot

Job for varnish.service failed because the control process exited with error code.

...Child (16564) not responding to CLI, killed it.

acachy commented 5 years ago

Update. About vsmw_delseg() panic. It is not related to your patch, happens on clean master too (#cc3cd17).

It happens only on that "strange" server with stucked VCLs. And probably it's just one more type of "out of space" problem for /var/lib/varnish

I catch the server several times having low space (20mb of 650mb free) and falling in infinite loop "discard->panic->child restart":

00:21:31 varnishd Child cleanup complete 00:21:31 varnishd Child (16183) Started 00:21:53 varnishd Child (16183) said Child starts 00:21:53 varnishd Child (16183) said SMF.medium0 mmap'ed 201326592000 bytes of 201326592000 00:21:53 varnishd Child (16183) said SMF.medium1 mmap'ed 201326592000 bytes of 201326592000 00:24:24 varnishd CLI telnet 127.0.0.1 65226 127.0.0.1 7090 Rd vcl.load v01c22d63-6f1c-4d92-b188-8d60094ee84d /etc/varnish/user.vcl 00:24:44 varnishd CLI telnet 127.0.0.1 65226 127.0.0.1 7090 Wr 200 VCL compiled. 00:24:44 varnishd CLI telnet 127.0.0.1 7532 127.0.0.1 7090 Rd vcl.use v01c22d63-6f1c-4d92-b188-8d60094ee84d 00:24:44 varnishd CLI telnet 127.0.0.1 7532 127.0.0.1 7090 Wr 200 VCL 'v01c22d63-6f1c-4d92-b188-8d60094ee84d' now active 00:25:30 varnishd CLI telnet 127.0.0.1 9486 127.0.0.1 7090 Rd vcl.list 00:25:30 varnishd CLI telnet 127.0.0.1 9486 127.0.0.1 7090 Wr 200 available auto cold 0 v340b7e63-d70f-4985-932d-80d740aea930#012available auto cold 0 v46927630-8bd4-4024-b3b2-9d32216df178#012available auto cold 0 vf0285796-08ce-4221-942d-a5ce5693d955#012available auto cold 0 v0ce7e43c-2872-4065-805f-463b1c1194aa#012available auto cold 0 v6c4f4613-c2c9-4ef2-98b1-22bfaf3cce62#012available auto cold 0 v49c4ca17-00c2-4b95-a3b1-9d03e37078a3#012available auto cold 0 v116ed0f7-abda-4b14-bb8d-c7c533da7bc3#012available auto cold 0 vb0167cd5-b42d-412b-bb97-831b452428eb#012available auto cold 0 v928b79bf-3b0f-40ed-8abc-2a7fa4583169#012available auto cold 0 v57286244-d482-4db9-9fdb-857d293b99f2#012available auto cold 0 v96dca88a-10fa-42bd-97f4-b1dcf77551e3#012available auto cold 0 vdda5c30b-fc5c-4510-9de2-5c51ea763d23#012available auto cold 0 vabe59324-eaec-4100-ae5c-9219054f8ea3#012available auto cold 0 v9c1b85d5-11f0-4d05-9fee-9a9a92c8fede#012available auto cold 0 v7aa03358-304a-41f3-84b5-c9f45aaf5ba3#012available auto cold 0 va0cc69af-dbd4-4625-a8c5-2cd712176ff1#012available auto cold 0 vf02e19af-4aec-4c44-9f2a-0627dcf8e9c4#012available auto cold 0 ve4b4eeac-2cb5-4c64-801f-72290d038c3f#012available auto cold 0 vdc2e10bd-64c8-48e2-8f44-f1df407b8db1#012available auto cold 0 v760936a3-203c-4502-8279-c14f5cfc6043#012available auto cold 0 v3d92cc63-1e8d-48a1-9807-61b8a8d05a1d#012available auto cold 0 vd769414b-d542-4049-b43b-0a01a3386d9e#012available auto cold 0 vdc495745-20f7-42cb-be29-0387db5af1b8#012available auto cold 0 vf6e6bdc5-257b-4e06-967f-18646956388e#012available auto cold 0 vd1001f8e-3590-41ed-91fb-681e5ed51b28#012available auto cold 0 va912a70e-eb23-420c-b48c-45a46c5b843d#012available auto cold 0 v6f4358b5-2299-456b-a3e7-05e4da777bb1#012available auto cold 0 v8f2fb3e7-aeb7-441f-9013-4aa33c96c3e0#012available auto cold 0 v6f5036e1-1d7d-43ad-a9cb-1b02ea7684cf#012available auto cold 0 v6da104e1-57e0-437c-80f0-031feee9e81b#012available auto cold 0 v169b9b09-8200 00:25:30 varnishd CLI telnet 127.0.0.1 9492 127.0.0.1 7090 Rd vcl.discard v340b7e63-d70f-4985-932d-80d740aea930 00:25:31 varnishd Child (16183) not responding to CLI, killed it. 00:25:31 varnishd CLI telnet 127.0.0.1 9492 127.0.0.1 7090 Wr 200 00:25:31 varnishd Child (16183) died signal=6 00:25:31 varnishd Child (16183) Panic at: Wed, 17 Apr 2019 00:25:31 GMT#012Assert error in vsmw_delseg(), common/common_vsmw.c line 214:#012 Condition(s == VSB_len(vsmw->vsb)) not true.#012version = varnish-trunk revision cc3cd17b51e289822fcdc46b76d03fd5d48e632d, vrt api = 9.0#012ident = Linux,4.19.0-4-amd64,x86_64,-junix,-sfile,-sfile,-smalloc,-smalloc,-sdefault,-hcritbit,epoll#012now = 411231.432136 (mono), 1555460731.007034 (real)#012Backtrace:#012 0x55b7233ca11d: /usr/sbin/varnishd(+0x5311d) [0x55b7233ca11d]#012 0x55b72343bee3: /usr/sbin/varnishd(VAS_Fail+0x13) [0x55b72343bee3]#012 0x55b7233f2b7c: /usr/sbin/varnishd(+0x7bb7c) [0x55b7233f2b7c]#012 0x55b7233f2e16: /usr/sbin/varnishd(VSMW_DestroyCluster+0x66) [0x55b7233f2e16]#012 0x7f579cea69a9: vcl_v340b7e63-d70f-4985-932d-80d740aea930.1555444806.194502/vgc.so(+0x3209a9) [0x7f579cea69a9]#012 0x7f579cea6a24: vcl_v340b7e63-d70f-4985-932d-80d740aea930.1555444806.194502/vgc.so(+0x320a24) [0x7f579cea6a24]#012 0x55b7233dae0c: /usr/sbin/varnishd(+0x63e0c) [0x55b7233dae0c]#012 0x55b7233dda1a: /usr/sbin/varnishd(VCL_Poll+0x12a) [0x55b7233dda1a]#012 0x55b72343de02: /usr/sbin/varnishd(+0xc6e02) [0x55b72343de02]#012 0x55b72343e28f: /usr/sbin/varnishd(VCLS_Poll+0x35f) [0x55b72343e28f]#012thread = (cache-main)#012pthread.attr = {#012 guard = 0,#012 stack_bottom = 0x7ffed88bd000,#012 stack_top = 0x7ffed90bc000,#012 stack_size = 8384512,#012}#012thr.req = (nil) {#012},#012thr.busyobj = (nil) {#012},#012vmods = {#012 std = {0x7fb569ee90e0, Varnish trunk cc3cd17b51e289822fcdc46b76d03fd5d48e632d, 0.0},#012 directors = {0x7fb569ee9150, Varnish trunk cc3cd17b51e289822fcdc46b76d03fd5d48e632d, 0.0},#012},#012 00:25:31 varnishd Child cleanup complete 00:25:31 varnishd Child (17388) Started 00:25:52 varnishd Child (17388) said Child starts 00:25:52 varnishd Child (17388) said SMF.medium0 mmap'ed 201326592000 bytes of 201326592000 00:25:52 varnishd Child (17388) said SMF.medium1 mmap'ed 201326592000 bytes of 201326592000 00:28:25 varnishd CLI telnet 127.0.0.1 16094 127.0.0.1 7090 Rd vcl.load vafb30477-4622-457b-90b2-f42bbd715191 /etc/varnish/user.vcl 00:28:44 varnishd CLI telnet 127.0.0.1 16094 127.0.0.1 7090 Wr 200 VCL compiled. 00:28:44 varnishd CLI telnet 127.0.0.1 16836 127.0.0.1 7090 Rd vcl.use vafb30477-4622-457b-90b2-f42bbd715191 00:28:44 varnishd CLI telnet 127.0.0.1 16836 127.0.0.1 7090 Wr 200 VCL 'vafb30477-4622-457b-90b2-f42bbd715191' now active 00:29:31 varnishd CLI telnet 127.0.0.1 18836 127.0.0.1 7090 Rd vcl.list 00:29:31 varnishd CLI telnet 127.0.0.1 18836 127.0.0.1 7090 Wr 200 available auto cold 0 v46927630-8bd4-4024-b3b2-9d32216df178#012available auto cold 0 vf0285796-08ce-4221-942d-a5ce5693d955#012available auto cold 0 v0ce7e43c-2872-4065-805f-463b1c1194aa#012available auto cold 0 v6c4f4613-c2c9-4ef2-98b1-22bfaf3cce62#012available auto cold 0 v49c4ca17-00c2-4b95-a3b1-9d03e37078a3#012available auto cold 0 v116ed0f7-abda-4b14-bb8d-c7c533da7bc3#012available auto cold 0 vb0167cd5-b42d-412b-bb97-831b452428eb#012available auto cold 0 v928b79bf-3b0f-40ed-8abc-2a7fa4583169#012available auto cold 0 v57286244-d482-4db9-9fdb-857d293b99f2#012available auto cold 0 v96dca88a-10fa-42bd-97f4-b1dcf77551e3#012available auto cold 0 vdda5c30b-fc5c-4510-9de2-5c51ea763d23#012available auto cold 0 vabe59324-eaec-4100-ae5c-9219054f8ea3#012available auto cold 0 v9c1b85d5-11f0-4d05-9fee-9a9a92c8fede#012available auto cold 0 v7aa03358-304a-41f3-84b5-c9f45aaf5ba3#012available auto cold 0 va0cc69af-dbd4-4625-a8c5-2cd712176ff1#012available auto cold 0 vf02e19af-4aec-4c44-9f2a-0627dcf8e9c4#012available auto cold 0 ve4b4eeac-2cb5-4c64-801f-72290d038c3f#012available auto cold 0 vdc2e10bd-64c8-48e2-8f44-f1df407b8db1#012available auto cold 0 v760936a3-203c-4502-8279-c14f5cfc6043#012available auto cold 0 v3d92cc63-1e8d-48a1-9807-61b8a8d05a1d#012available auto cold 0 vd769414b-d542-4049-b43b-0a01a3386d9e#012available auto cold 0 vdc495745-20f7-42cb-be29-0387db5af1b8#012available auto cold 0 vf6e6bdc5-257b-4e06-967f-18646956388e#012available auto cold 0 vd1001f8e-3590-41ed-91fb-681e5ed51b28#012available auto cold 0 va912a70e-eb23-420c-b48c-45a46c5b843d#012available auto cold 0 v6f4358b5-2299-456b-a3e7-05e4da777bb1#012available auto cold 0 v8f2fb3e7-aeb7-441f-9013-4aa33c96c3e0#012available auto cold 0 v6f5036e1-1d7d-43ad-a9cb-1b02ea7684cf#012available auto cold 0 v6da104e1-57e0-437c-80f0-031feee9e81b#012available auto cold 0 v169b9b09-8200-44a7-bd4c-ccbecfa0bab3#012available auto cold 0 v4fdaf50b-fee 00:29:32 varnishd CLI telnet 127.0.0.1 18856 127.0.0.1 7090 Rd vcl.discard v46927630-8bd4-4024-b3b2-9d32216df178 00:29:32 varnishd Child (17388) not responding to CLI, killed it.

After I've made /var/lib/varnish huge - there are no such crashes, just ~1GB eaten for the first day having 100+ old VCLs.

I've made complete OS reinstall to be sure that it is not some crazy old library, but everything is still the same. Hope I'll provide something useful further when succeed in dealing with core dump.

nigoroll commented 5 years ago

@acachy this makes perfect sense. I was already suspecting that we would run be running into an out-of-space condition, but I was wondering why we would not see errno set for the short write in VSMW_DestroyCluster() / vsmw_delseg() in the panic https://github.com/varnishcache/varnish-cache/issues/2764#issuecomment-482455093 But I simply overlooked the fact that the last (short) write succeeds and only a follow up attempt triggers an ENOSPC as demonstrated easily on a small filesystem:

# truss dd if=/dev/zero of=t bs=123456
...
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 123456) = 123456
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 123456) = 123456
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 123456) = 60928
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 62528) = -1 ENOSPC (No space left on device)

So is the following an appropriate summary from your point of view?

Users must ensure that the varnish working directory has sufficient space available
Varnish should not panic for ENOSPC
We should work on optimizing runtime with many backends
You are suspecting an issue with VCL references piling up. This could be a leak, but also could just be normal, at any rate, we lack visibility

Is there anything missing in the list?

nigoroll commented 5 years ago

More Ideas:

for visibility, we could maybe have a vcl.explainref command or similar explaining the references which hold a vcl
you mentioned dynamic backends in the initial note. Are you talking about https://github.com/nigoroll/libvmod-dynamic ? If so, does changing domain_usage_timeout change anything
We should have the option to Connection: close client connections on the non-active VCL

broader idea (sort of unrelated)

We should actually allow setting the client timeouts (idle_send_timeout, send_timeout, timeout_idle, timeout_linger) per session from VCL. This would allow to tune timeouts based on how trustworthy one particular client is (e.g. anonymous client vs. authenticated vs. CDN servers).

acachy commented 5 years ago

Varnish should not panic for ENOSPC

For all cases, including vcl.discard and vcl.load.

for visibility, we could maybe have a vcl.explainref command or similar explaining the references which hold a vcl

Suddenly my "strange" server has stopped having stucked VCLs at all and looks like this rare bug wouldn't be investigated now. However, your solution with forced "Connection: close" might be a complete fix for this cases. So, this additional debug-feature may not be useful.

We should work on optimizing runtime with many backends

Specifying - vcl.discard and varnishstat performance. Other things are OK for many backends configuration.

you mentioned dynamic backends in the initial note. Are you talking about https://github.com/nigoroll/libvmod-dynamic

Initial note is not mine. I use only default modules (directors, std) and ~5k normal backends.

Extra: correct parallel execution of all CLI commands, including

varnishstat during reloads
crossing reload\discard tasks At least it shouldn't cause bogus CLI timeout.

bsdphk commented 5 years ago

@acachy do you have TCP_KEEPALIVE configured on these servers ?

As far as I know, the only way VCL's can be held for a long time is of worker threads are stuck waiting on clients which have gone away but never send a packet to say so. TCP_KEEPALIVE should fix that.

As for reference counts increasing after vcl.discard, that can happen with ESI requests in-flight when the vcl.discard is executed, but I know of no other cases where that should happen.

acachy commented 5 years ago

@bsdphk

As for reference counts increasing after vcl.discard, that can happen with ESI requests in-flight when the vcl.discard is executed, but I know of no other cases where that should happen.

ESI is not in use

As far as I know, the only way VCL's can be held for a long time is of worker threads are stuck waiting on clients which have gone away but never send a packet to say so. TCP_KEEPALIVE should fix that.

TCP_KEEPALIVE settings = default (75-9-7200)
I have tcp_fastopen=on
The problem was caused by 100+ remote monitoring servers: 1) Each one had made requests each few seconds 2) Each one could die on any possible connection state

So it is very unusual case, but anyway there should be some bugs around.

mbgrydeland commented 5 years ago

There's been numerous updates to the VSM code to improve performance in the latest Varnish release. It is suspected that that will have improved the situation, and I will close this ticket.

varnishcache / varnish-cache