varnishcache / varnish-cache

Varnish Cache source code repository
https://www.varnish-cache.org
Other
3.68k stars 377 forks source link

High file count in /var/lib/varnish/HOST/. (6.0.0) #2712

Closed lkarsten closed 5 years ago

lkarsten commented 6 years ago

Issue: On 6.0.0 from packagecloud, varnishstat in ncurses and varnishstat -j -1 is running slowly and using a lot of CPU.

Expected: varnishstat -1 -j return almost instantly.

# time strace -o foo2 -s500 varnishstat -1 -j 1>/dev/null

real    9m19.627s
user    9m16.137s
sys 0m2.975s

# grep openat foo2 | wc -l
1122

# find /var/lib/varnish/HOST/ -type d | wc -l
1076

This is a system that does automatic VCL reloads, and has maybe ~100 backends.

According to the strace output, it spends 150-200ms for each of the many files in /var/lib/varnish/HOST/:

13:38:02.979966 openat(AT_FDCWD, "/var/lib/varnish/HOST//_.vsm_child/_.VSC_cluster.2c2227ce548d18c4", O_RDONLY) = 8
13:38:02.980075 mmap(NULL, 12288, PROT_READ, MAP_SHARED, 8, 0) = 0x7f7a7c895000
13:38:02.980118 close(8)                = 0
13:38:03.018232 brk(0xa3bb000)          = 0xa3bb000
13:38:03.200755 brk(0xa3dc000)          = 0xa3dc000
13:38:03.264159 openat(AT_FDCWD, "/var/lib/varnish/HOST//_.vsm_child/_.VSC_cluster.14f6762e7457da35", O_RDONLY) = 8
13:38:03.264291 mmap(NULL, 12288, PROT_READ, MAP_SHARED, 8, 0) = 0x7f7a7c892000
13:38:03.264339 close(8)                = 0
13:38:03.419321 brk(0xa3fd000)          = 0xa3fd000
13:38:03.595609 openat(AT_FDCWD, "/var/lib/varnish/HOST//_.vsm_child/_.VSC_cluster.0769b9bc62c9b5f8", O_RDONLY) = 8
13:38:03.595749 mmap(NULL, 12288, PROT_READ, MAP_SHARED, 8, 0) = 0x7f7a7c88f000
13:38:03.595811 close(8)                = 0
13:38:03.657468 brk(0xa41e000)          = 0xa41e000
13:38:03.919323 brk(0xa43f000)          = 0xa43f000
13:38:03.998852 openat(AT_FDCWD, "/var/lib/varnish/HOST//_.vsm_child/_.VSC_cluster.5d1404c31bc1e73e", O_RDONLY) = 8
13:38:03.998966 mmap(NULL, 12288, PROT_READ, MAP_SHARED, 8, 0) = 0x7f7a7c88c000
13:38:03.999009 close(8)                = 0
13:38:04.235989 brk(0xa460000)          = 0xa460000

Another oddity on this system is that "varnishadm vcl.list" is empty, although there are a massive amount of VCLs loaded. It returns 200, so supposedly not cli_buffer. I'll debug that and report a separate issue.

lkarsten commented 6 years ago

[10:28:00] < phk> scn, ping ? What is a "high file count" numerically ? 1000 ? 10000 ? 100000 ?

There were around 1000 directories in /var/lib/varnish/HOST/. If I remember correctly, each directory had a vgc.so file in it.

The configuration system was producing failing VCL (vmod package was not installed) at the time this happened. It was either failing varnishd -C -f foo.vcl or VCL load (via systemctl reload varnish), every ~60s for at least 24 hours.

There is perhaps an underlaying bug here that these files are left in place, in addition to the user incomprehensible sudden slowness of the logging tools/vsc clients.

bsdphk commented 6 years ago

Failing VCL compiles should leave no trace in the filesystem, if it does, that's a bug.

bsdphk commented 6 years ago

Close this for now, new tickets will be opened if there are compile-failure residuals.

lkarsten commented 5 years ago

Hi.

Today this problem was back on a 6.0.1 rig.

# time varnishstat -1 1>/dev/null

real    0m0.768s
user    0m0.739s
sys 0m0.028s
# find /var/lib/varnish  -name vgc.so | wc -l
50
# varnishstat  -V
varnishstat (varnish-6.0.1 revision 8d54bec5330c29304979ebf2c425ae14ab80493c)
Copyright (c) 2006 Verdens Gang AS
Copyright (c) 2006-2015 Varnish Software AS

I went through the commits in git master from Aug 6 and a month back, but couldn't find a fix.

This isn't a big deal right now, but for anyone that is using Telegraf for monitoring you quickly run into https://github.com/influxdata/telegraf/blob/master/plugins/inputs/varnish/varnish.go#L81 , where a hardcoded 200ms timeout kills off all data collection from Varnish. That is course something for the telegraf maintainers, but as mentioned earlier in this issue, in some cases varnishstat -1 can take minutes.

There is some underlaying issue here that should be tended to at some point.

lkarsten commented 5 years ago

No, this is just user error.

I seem to have 50 VCLs loaded, so naturally there are 50 vgc.so files. I will now consult the VCL cleanup script and ask it nicely to do what it is supposed to do.

Sorry for the noise.