Closed matheuzzy closed 3 years ago
Hi,
I'm probably missing something, but it seems like an overly contrived way of saying that if you panic, Varnish restarts, losing all the global variable. And I feel there's already warning about this here: https://github.com/varnish/varnish-modules/blob/master/src/vmod_var.vcc#L8 Or am I missing something?
If you want to keep state across restarts, you can however use a file of disk or a vmod like vmod_redis
Hello! thanks for your reply.
My point is you have no option to keep that state if Varnish child panic.
If you store it in the global var -> sure, it will be lost But If you re-write VCL dynamically (use a file as you said) and update vcl vcl_init routine - it will not be read on child restart. It will be read only if Varnish main service is restarted/reloaded, but child panic do not imply Varnish main service to do so. ( in my vtc std.log("Init only once!"); shows only once )
Or maybe the information I was missing is that VCL is not re-read form disk on child panic. I thought after a panic, VCL is loaded again and vcl_init is called again ("Called when VCL is loaded, before any requests pass through it. Typically used to initialize VMODs.")
so, there seems to be (have been) some confusion on the VCL lifecycle, but basically:
vcl_init
is runvcl_init
, but it's never read/compiled/storedvcl_init
(basically what you said, but it could be useful for future readers)
One way to reason about it is that until you've actually compiled the file, Varnish doesn't know anything about it, and you could have changed /etc/varnish/default.vcl
or /tmp/foo.vcl
, it doesn't matter to varnish.
I don't think it warrants a docfix here, but the upstream docs would benefit from so clarification if you feel that part was missing.
By the way, I know panics will happen, but they shouldn't, was that ticket triggered by an actually panic you observed, or is it purely theoretical?
Thanks for making the scenario more clear!
the varnish manager pulls the previously active VCL, VCL1 and runs vcl_init
^ I'm not sure that vcl_init is called at all because I don't see any output from std.log in the VTC run; Or maybe the custom defined one is not executed (just the build-in one)..?
> varnishtest panic.vtc -v | grep VCL_Log
**** v1 0.3 vsl| 0 VCL_Log - Init only once!
**** v1 1.7 vsl| 2 VCL_Log c /set_lb_mode/closed
**** v1 1.7 vsl| 2 VCL_Log c SET global var: closed
**** v1 3.0 vsl| 2 VCL_Log c /
>
Yes, for me, some clarifications in vcl_init doc would have helped to protect agains this case. After I found this case, the solution was very simple: change the VCL and reload varnish service. Is more convenient and consistent than global vars. Maybe global var is useful when you have many updates on it's value and don't care if you lose it; maybe a counter for a particular case.
The main info that messed me up are the following:
So since vcl_init is the place where vmods are initialized and the var is kept for as long as vmod is loaded.. makes you think you have covered all the cases. And of course, a naive test is to restart/reload varnish and it will work perfectly :)
For sure this is a corner case and I know Varnish has an impresive way to handle fails (that's way panic was invented and so on) and that's why has a complex VCL lifecycle, but for me (and maybe other vcl devs) this info might me useful. So this this is more like a feedback, not a real issue.
By the way, what do you think about the example with "global variables as state flags" in the docs? maybe an extra info about how to store it's value after restart (or panic) is helpful? maybe no one else will go with a wrong similar solution as mine.
This comes from a real life incident, we use this mechanism for an important business flow to keep some info secure on the website from our competition for a time interval( 5-6 hours while updates & tests are made to website) and open it to the audience after.
The panic was produced by a brutal ntpd update that caused a "Clock step detected":
Oct 5 01:59:11 ***-prod ntpd[64655]: 0.0.0.0 061c 0c clock_step -18.563490 s
Oct 5 01:59:11 ***-prod ntpd[64655]: 0.0.0.0 061c 0c clock_step -18.563490 s
Oct 5 01:59:11 ***-prod ntpd[64655]: 0.0.0.0 0615 05 clock_sync
Oct 5 01:59:11 ***-prod ntpd[64655]: 0.0.0.0 0615 05 clock_sync
Oct 5 01:59:11 ***-prod systemd: Time has been changed
Oct 5 01:59:11 ***-prod systemd: Time has been changed
Oct 5 01:59:11 ***-prod kernel: cache-worker[81396]: segfault at 28 ip 000000000043b0ee sp 00007fdd018d3100 error 4 in varnishd[400000+115000]
Oct 5 01:59:11 ***-prod kernel: cache-worker[81396]: segfault at 28 ip 000000000043b0ee sp 00007fdd018d3100 error 4 in varnishd[400000+115000]
Oct 5 01:59:13 ***-prod ntpd[64655]: 0.0.0.0 c618 08 no_sys_peer
Oct 5 01:59:13 ***-prod ntpd[64655]: 0.0.0.0 c618 08 no_sys_peer
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (68126) died signal=11
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (68126) died signal=11
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (68126) Panic at: Mon, 04 Oct 2021 22:59:13 GMT#012Wrong turn at cache/cache_session.c:544:#012Clock step detected#012version = varnish-6.0.5 revision 204a927f4a4283529fc89f5182fe8cc3f2d0f617, vrt api = 7.1#012ident = Linux,3.10.0-1160.31.1.el7.x86_64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll#012now = 9035617.928672 (mono), 1633388351.666667 (real)#012Backtrace:#012 0x43b47e: /usr/sbin/varnishd() [0x43b47e]#012 0x49ca82: /usr/sbin/varnishd(VAS_Fail+0x42) [0x49ca82]#012 0x444748: /usr/sbin/varnishd() [0x444748]#012 0x462d90: /usr/sbin/varnishd() [0x462d90]#012 0x458d63: /usr/sbin/varnishd() [0x458d63]#012 0x459220: /usr/sbin/varnishd() [0x459220]#012 0x7fdd435a0ea5: /lib64/libpthread.so.0(+0x7ea5) [0x7fdd435a0ea5]#012 0x7fdd432c99fd: /lib64/libc.so.6(clone+0x6d) [0x7fdd432c99fd]#012thread = (cache-worker)#012thr.req = 0x7fd486e30020 {#012 MAGIC at 0x7fd486e30020 is 0x00000000 (Should be: REQ_MAGIC/0x2751aaa1)#012 vxid = 0, transport = NULL#012 step = 0x0,#012 req_body = R_BODY_INIT,#012 restarts = 0, esi_level = 0,#012 ws = 0x7fd486e30170 {#012 MAGIC at 0x7fd486e30170 is 0x00000000 (Should be: WS_MAGIC/0x35fac554)#012 id = \"\",#012 {s, f, r, e} = {(nil), +0, +0, +0},#012 },#012 http[req] = (nil) {#012 },
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (68126) Panic at: Mon, 04 Oct 2021 22:59:13 GMT#012Wrong turn at cache/cache_session.c:544:#012Clock step detected#012version = varnish-6.0.5 revision 204a927f4a4283529fc89f5182fe8cc3f2d0f617, vrt api = 7.1#012ident = Linux,3.10.0-1160.31.1.el7.x86_64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll#012now = 9035617.928672 (mono), 1633388351.666667 (real)#012Backtrace:#012 0x43b47e: /usr/sbin/varnishd() [0x43b47e]#012 0x49ca82: /usr/sbin/varnishd(VAS_Fail+0x42) [0x49ca82]#012 0x444748: /usr/sbin/varnishd() [0x444748]#012 0x462d90: /usr/sbin/varnishd() [0x462d90]#012 0x458d63: /usr/sbin/varnishd() [0x458d63]#012 0x459220: /usr/sbin/varnishd() [0x459220]#012 0x7fdd435a0ea5: /lib64/libpthread.so.0(+0x7ea5) [0x7fdd435a0ea5]#012 0x7fdd432c99fd: /lib64/libc.so.6(clone+0x6d) [0x7fdd432c99fd]#012thread = (cache-worker)#012thr.req = 0x7fd486e30020 {#012 MAGIC at 0x7fd486e30020 is 0x00000000 (Should be: REQ_MAGIC/0x2751aaa1)#012 vxid = 0, transport = NULL#012 step = 0x0,#012 req_body = R_BODY_INIT,#012 restarts = 0, esi_level = 0,#012 ws = 0x7fd486e30170 {#012 MAGIC at 0x7fd486e30170 is 0x00000000 (Should be: WS_MAGIC/0x35fac554)#012 id = \"\",#012 {s, f, r, e} = {(nil), +0, +0, +0},#012 },#012 http[req] = (nil) {#012 },
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (138733) said Child starts
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (138733) said Child starts
^ I'm not sure that vcl_init is called at all because I don't see any output from std.log in the VTC run; Or maybe the custom defined one is not executed (just the build-in one)..?
Ye of little faith :-) When a panic occurs, the varnish child memory is gone, so you lose your cache of course, but also the var
state, variables and so on, so if you don't go through vcl_init
, there's no way access to var
will work.
Let's see if I can convince that we indeed go through vcl_init
with this vtc
:
varnishtest "panic"
server s1 {} -start
shell {
echo foo > ${tmpdir}/state
}
varnish v1 \
-arg "-p auto_restart=on" \
-arg "-p feature=+no_coredump" \
-vcl+backend {
import std;
import vtc;
sub vcl_init {
std.log("vcl_init:" + std.fileread("${tmpdir}/state"));
}
sub vcl_recv {
vtc.panic("boom");
}
} -start
shell {
echo bar > ${tmpdir}/state
}
varnish v1 -clierr 400 debug.panic.worker
delay 1
varnish v1 -cliok "panic.clear"
shell {
varnishlog -g raw -d -n ${v1_name} | grep "vcl_init:bar"
}
varnish v1 -expectexit 0x40
The important steps are:
foo
in the state fileThe reason why you didn't see the second init in your vtc is probably that when the panic occurs, there's a gap in the log reading, and you lose a few line, but with varnishlog -d
you can backlog and see that the lines were indeed written.
By the way, what do you think about the example with "global variables as state flags" in the docs? maybe an extra info about how to store it's value after restart (or panic) is helpful? maybe no one else will go with a wrong similar solution as mine.
I'm honestly open to PRs, you got tripped by that one so your experience is relevant, however, be careful not to muddy the water too much. Describing the full solution will probably take double the current amount of documentation. So maybe just a line mentioning "you should also take care of persisting the information in case of a restart/panic" would be enough?
By the way, if persisting the data is important, again, use redis
, sqlite3
, memcached
and the associated vmods
thanks for you time @gquintard! 👍
Since I found varnishtest I learned a lot and I trust it too much I think :)
Your VTC proved vcl init usage + one more thing: the content from std.fileread is not stored by the Varnish manager, so this a perfect solution to persist state after panic. ( funny is that I had this solution with std file, but I changed it a time ago ..)
Hello!
Consider the following scenario: implement a maintenance open/close mechanism for load balancer (Varnish) using VMOD var. Be sure to keep the state after Varnish restart/reload!
Solution:
The issue occurs after a Varnish child panic for some (other) reasons -> the global var value is lost + the vcl_init routine is not called when child re-start
Varnish Test Case:
PS: i didn't find a way to demonstrate the overwrite of vcl_ini part, but for sure is not re-read since std.log("Init only once!") shows just once.