varnish / varnish-modules

Collection of Varnish Cache modules (vmods) by Varnish Software
Other
185 stars 86 forks source link

update docs to reflect potential risk regarding Global variables after Varnish child panic #190

Closed matheuzzy closed 3 years ago

matheuzzy commented 3 years ago

Hello!

Consider the following scenario: implement a maintenance open/close mechanism for load balancer (Varnish) using VMOD var. Be sure to keep the state after Varnish restart/reload!

Solution:

#!/usr/bin/env bash

set -e

echo "Write VCL lb mode"
echo "Set $1 in /var/www/lb_mode.vcl"

echo "sub vcl_init {
    var.global_set(\"lb_mode\", \"$1\");
}" > /var/www/lb_mode.vcl

echo "CURL local varnish to set new lb_mode in global var"
curl "http://127.0.0.1:6081/set_lb_mode/$1"

The issue occurs after a Varnish child panic for some (other) reasons -> the global var value is lost + the vcl_init routine is not called when child re-start

Varnish Test Case:

varnishtest "VCL: check global value after panic"

barrier b1 cond 2
barrier b2 cond 2

server s1 {
    rxreq
    txresp -hdr "Foo: bar" -body "abcdef\n"
    rxreq
    txresp -hdr "Panic: fetch" -body "012345\n"
    close

    barrier b1 sync
    accept
    rxreq
    txresp -hdr "Foo: bar" -body "abcdef\n"
    rxreq
    txresp -hdr "Panic: deliver" -body "012345\n"
    close

    barrier b2 sync
    accept
    rxreq
    txresp -hdr "Foo: foo" -body "abcdef\n"
} -start

varnish v1 -arg "-sdefault,1m" -vcl+backend {
    import vtc;
    import std;
    import var;
    sub vcl_init {
        var.global_set("load_balancer_mode","open");
        std.log("Init only once!");
    }
    sub vcl_recv {

        std.log(req.url);
        if (req.url ~ "^/set_lb_mode/") {
            var.global_set("load_balancer_mode", regsub(req.url, "^/set_lb_mode/(.*)$", "\1"));
            std.log("SET global var: " + var.global_get("load_balancer_mode"));
        }
    }

    sub vcl_backend_response {
        if (beresp.http.panic == "fetch") {
            vtc.panic("Had Panic header: " + beresp.http.panic);
        }
    }
    sub vcl_deliver {
        set resp.http.load_balancer_mode=var.global_get("load_balancer_mode");
        if (resp.http.panic == "deliver") {
            vtc.panic("Had Panic header: " + resp.http.panic);
        }
    }
} -start

varnish v1 -cliok "stop"
varnish v1 -cliok "start"
varnish v1 -wait-running
varnish v1 -expect MGT.child_panic == 0
varnish v1 -cliok "param.set feature +no_coredump"

client c1 {
    txreq -url "/set_lb_mode/closed"
    rxresp
    expect resp.http.load_balancer_mode == "closed"
    txreq -url "/foo"
    # Don't expect answer, the server crashed.
} -run

varnish v1 -wait-stopped
varnish v1 -cliok "panic.show"
varnish v1 -clijson "panic.show -j"
varnish v1 -cliok "panic.clear"
varnish v1 -expect MGT.child_panic == 1
varnish v1 -clierr 300 "panic.clear"
varnish v1 -cliok "start"
varnish v1 -wait-running
barrier b1 sync

client c1 {
    txreq -url "/"
    rxresp
    expect resp.http.load_balancer_mode == "closed"
    txreq -url "/foo"
    # Don't expect answer, the server crashed.
} -run

varnish v1 -wait-stopped
varnish v1 -cliok "panic.show"
varnish v1 -clijson "panic.show -j"
varnish v1 -cliok "panic.clear -z"
varnish v1 -expect MGT.child_panic == 0
varnish v1 -clierr 300 "panic.clear"
varnish v1 -cliok "start"
varnish v1 -wait-running
barrier b2 sync

client c1 {
    txreq -url "/"
    rxresp
    expect resp.http.load_balancer_mode == "closed"
    expect resp.http.foo == "foo"
} -run

varnish v1 -cliok "panic.clear -z"
varnish v1 -expectexit 0x20

PS: i didn't find a way to demonstrate the overwrite of vcl_ini part, but for sure is not re-read since std.log("Init only once!") shows just once.

gquintard commented 3 years ago

Hi,

I'm probably missing something, but it seems like an overly contrived way of saying that if you panic, Varnish restarts, losing all the global variable. And I feel there's already warning about this here: https://github.com/varnish/varnish-modules/blob/master/src/vmod_var.vcc#L8 Or am I missing something?

If you want to keep state across restarts, you can however use a file of disk or a vmod like vmod_redis

matheuzzy commented 3 years ago

Hello! thanks for your reply.

My point is you have no option to keep that state if Varnish child panic.

If you store it in the global var -> sure, it will be lost But If you re-write VCL dynamically (use a file as you said) and update vcl vcl_init routine - it will not be read on child restart. It will be read only if Varnish main service is restarted/reloaded, but child panic do not imply Varnish main service to do so. ( in my vtc std.log("Init only once!"); shows only once )

Or maybe the information I was missing is that VCL is not re-read form disk on child panic. I thought after a panic, VCL is loaded again and vcl_init is called again ("Called when VCL is loaded, before any requests pass through it. Typically used to initialize VMODs.")

gquintard commented 3 years ago

so, there seems to be (have been) some confusion on the VCL lifecycle, but basically:

(basically what you said, but it could be useful for future readers)

One way to reason about it is that until you've actually compiled the file, Varnish doesn't know anything about it, and you could have changed /etc/varnish/default.vcl or /tmp/foo.vcl, it doesn't matter to varnish.

I don't think it warrants a docfix here, but the upstream docs would benefit from so clarification if you feel that part was missing.

By the way, I know panics will happen, but they shouldn't, was that ticket triggered by an actually panic you observed, or is it purely theoretical?

matheuzzy commented 3 years ago

Thanks for making the scenario more clear!

the varnish manager pulls the previously active VCL, VCL1 and runs vcl_init

^ I'm not sure that vcl_init is called at all because I don't see any output from std.log in the VTC run; Or maybe the custom defined one is not executed (just the build-in one)..?

> varnishtest panic.vtc -v | grep VCL_Log
**** v1    0.3 vsl|          0 VCL_Log         - Init only once!
**** v1    1.7 vsl|          2 VCL_Log         c /set_lb_mode/closed
**** v1    1.7 vsl|          2 VCL_Log         c SET global var: closed
**** v1    3.0 vsl|          2 VCL_Log         c /
> 

Yes, for me, some clarifications in vcl_init doc would have helped to protect agains this case. After I found this case, the solution was very simple: change the VCL and reload varnish service. Is more convenient and consistent than global vars. Maybe global var is useful when you have many updates on it's value and don't care if you lose it; maybe a counter for a particular case.

The main info that messed me up are the following:

  1. vcl_init .. "Typically used to initialize VMODs"
  2. vmod var: "Global variables have a lifespan .. as long as the vmod is loaded"

So since vcl_init is the place where vmods are initialized and the var is kept for as long as vmod is loaded.. makes you think you have covered all the cases. And of course, a naive test is to restart/reload varnish and it will work perfectly :)

For sure this is a corner case and I know Varnish has an impresive way to handle fails (that's way panic was invented and so on) and that's why has a complex VCL lifecycle, but for me (and maybe other vcl devs) this info might me useful. So this this is more like a feedback, not a real issue.

By the way, what do you think about the example with "global variables as state flags" in the docs? maybe an extra info about how to store it's value after restart (or panic) is helpful? maybe no one else will go with a wrong similar solution as mine.

This comes from a real life incident, we use this mechanism for an important business flow to keep some info secure on the website from our competition for a time interval( 5-6 hours while updates & tests are made to website) and open it to the audience after.

The panic was produced by a brutal ntpd update that caused a "Clock step detected":

Oct 5 01:59:11 ***-prod ntpd[64655]: 0.0.0.0 061c 0c clock_step -18.563490 s
Oct 5 01:59:11 ***-prod ntpd[64655]: 0.0.0.0 061c 0c clock_step -18.563490 s
Oct 5 01:59:11 ***-prod ntpd[64655]: 0.0.0.0 0615 05 clock_sync
Oct 5 01:59:11 ***-prod ntpd[64655]: 0.0.0.0 0615 05 clock_sync
Oct 5 01:59:11 ***-prod systemd: Time has been changed
Oct 5 01:59:11 ***-prod systemd: Time has been changed
Oct 5 01:59:11 ***-prod kernel: cache-worker[81396]: segfault at 28 ip 000000000043b0ee sp 00007fdd018d3100 error 4 in varnishd[400000+115000]
Oct 5 01:59:11 ***-prod kernel: cache-worker[81396]: segfault at 28 ip 000000000043b0ee sp 00007fdd018d3100 error 4 in varnishd[400000+115000]
Oct 5 01:59:13 ***-prod ntpd[64655]: 0.0.0.0 c618 08 no_sys_peer
Oct 5 01:59:13 ***-prod ntpd[64655]: 0.0.0.0 c618 08 no_sys_peer
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (68126) died signal=11
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (68126) died signal=11
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (68126) Panic at: Mon, 04 Oct 2021 22:59:13 GMT#012Wrong turn at cache/cache_session.c:544:#012Clock step detected#012version = varnish-6.0.5 revision 204a927f4a4283529fc89f5182fe8cc3f2d0f617, vrt api = 7.1#012ident = Linux,3.10.0-1160.31.1.el7.x86_64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll#012now = 9035617.928672 (mono), 1633388351.666667 (real)#012Backtrace:#012 0x43b47e: /usr/sbin/varnishd() [0x43b47e]#012 0x49ca82: /usr/sbin/varnishd(VAS_Fail+0x42) [0x49ca82]#012 0x444748: /usr/sbin/varnishd() [0x444748]#012 0x462d90: /usr/sbin/varnishd() [0x462d90]#012 0x458d63: /usr/sbin/varnishd() [0x458d63]#012 0x459220: /usr/sbin/varnishd() [0x459220]#012 0x7fdd435a0ea5: /lib64/libpthread.so.0(+0x7ea5) [0x7fdd435a0ea5]#012 0x7fdd432c99fd: /lib64/libc.so.6(clone+0x6d) [0x7fdd432c99fd]#012thread = (cache-worker)#012thr.req = 0x7fd486e30020 {#012 MAGIC at 0x7fd486e30020 is 0x00000000 (Should be: REQ_MAGIC/0x2751aaa1)#012 vxid = 0, transport = NULL#012 step = 0x0,#012 req_body = R_BODY_INIT,#012 restarts = 0, esi_level = 0,#012 ws = 0x7fd486e30170 {#012 MAGIC at 0x7fd486e30170 is 0x00000000 (Should be: WS_MAGIC/0x35fac554)#012 id = \"\",#012 {s, f, r, e} = {(nil), +0, +0, +0},#012 },#012 http[req] = (nil) {#012 },
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (68126) Panic at: Mon, 04 Oct 2021 22:59:13 GMT#012Wrong turn at cache/cache_session.c:544:#012Clock step detected#012version = varnish-6.0.5 revision 204a927f4a4283529fc89f5182fe8cc3f2d0f617, vrt api = 7.1#012ident = Linux,3.10.0-1160.31.1.el7.x86_64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll#012now = 9035617.928672 (mono), 1633388351.666667 (real)#012Backtrace:#012 0x43b47e: /usr/sbin/varnishd() [0x43b47e]#012 0x49ca82: /usr/sbin/varnishd(VAS_Fail+0x42) [0x49ca82]#012 0x444748: /usr/sbin/varnishd() [0x444748]#012 0x462d90: /usr/sbin/varnishd() [0x462d90]#012 0x458d63: /usr/sbin/varnishd() [0x458d63]#012 0x459220: /usr/sbin/varnishd() [0x459220]#012 0x7fdd435a0ea5: /lib64/libpthread.so.0(+0x7ea5) [0x7fdd435a0ea5]#012 0x7fdd432c99fd: /lib64/libc.so.6(clone+0x6d) [0x7fdd432c99fd]#012thread = (cache-worker)#012thr.req = 0x7fd486e30020 {#012 MAGIC at 0x7fd486e30020 is 0x00000000 (Should be: REQ_MAGIC/0x2751aaa1)#012 vxid = 0, transport = NULL#012 step = 0x0,#012 req_body = R_BODY_INIT,#012 restarts = 0, esi_level = 0,#012 ws = 0x7fd486e30170 {#012 MAGIC at 0x7fd486e30170 is 0x00000000 (Should be: WS_MAGIC/0x35fac554)#012 id = \"\",#012 {s, f, r, e} = {(nil), +0, +0, +0},#012 },#012 http[req] = (nil) {#012 },
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (138733) said Child starts
Oct 5 01:59:13 ***-prod varnishd[68116]: Child (138733) said Child starts
gquintard commented 3 years ago

^ I'm not sure that vcl_init is called at all because I don't see any output from std.log in the VTC run; Or maybe the custom defined one is not executed (just the build-in one)..?

Ye of little faith :-) When a panic occurs, the varnish child memory is gone, so you lose your cache of course, but also the var state, variables and so on, so if you don't go through vcl_init, there's no way access to var will work.

Let's see if I can convince that we indeed go through vcl_init with this vtc:

varnishtest "panic"

server s1 {} -start

shell {
    echo foo > ${tmpdir}/state
}

varnish v1 \
    -arg "-p auto_restart=on" \
    -arg "-p feature=+no_coredump" \
    -vcl+backend {
    import std;
    import vtc;

    sub vcl_init {
        std.log("vcl_init:" + std.fileread("${tmpdir}/state"));
    }

    sub vcl_recv {
        vtc.panic("boom");
    }
} -start

shell {
    echo bar > ${tmpdir}/state
}

varnish v1 -clierr 400 debug.panic.worker
delay 1
varnish v1 -cliok "panic.clear"

shell {
    varnishlog -g raw -d -n ${v1_name} | grep "vcl_init:bar"
}

varnish v1 -expectexit 0x40

The important steps are:

The reason why you didn't see the second init in your vtc is probably that when the panic occurs, there's a gap in the log reading, and you lose a few line, but with varnishlog -d you can backlog and see that the lines were indeed written.

By the way, what do you think about the example with "global variables as state flags" in the docs? maybe an extra info about how to store it's value after restart (or panic) is helpful? maybe no one else will go with a wrong similar solution as mine.

I'm honestly open to PRs, you got tripped by that one so your experience is relevant, however, be careful not to muddy the water too much. Describing the full solution will probably take double the current amount of documentation. So maybe just a line mentioning "you should also take care of persisting the information in case of a restart/panic" would be enough?

By the way, if persisting the data is important, again, use redis, sqlite3, memcached and the associated vmods

matheuzzy commented 3 years ago

thanks for you time @gquintard! 👍
Since I found varnishtest I learned a lot and I trust it too much I think :)

Your VTC proved vcl init usage + one more thing: the content from std.fileread is not stored by the Varnish manager, so this a perfect solution to persist state after panic. ( funny is that I had this solution with std file, but I changed it a time ago ..)