nigoroll / libvmod-dynamic

The Varnish dns/named director continued
BSD 2-Clause "Simplified" License
95 stars 34 forks source link

Backend seems sick on first access #35

Closed ghost closed 6 years ago

ghost commented 6 years ago

Hi.

I have Varnish 5.2.1 with vmod dynamic installed from Yum Repositories on CentOS7 Server, and i'm facing this situation: after a vcl reload the first call to a site server by vmod_dynamic director give me the mainenance page, like the backend was sick, bat the backend is up and running, so simply hitting F5 gave me access to the site.

Maybe a problem with the first dns lookup? Or maybe i've made some misconfiguration?

What other info do you need to help me with this problem?

Best Regard

nigoroll commented 6 years ago

Can you share relevant parts of your configuration and a varnishlog excerpt showing the issue, please? In particular, are you using a probe and, if yes, have you considered the .initial probe attribute?

ghost commented 6 years ago

Ok, i think that i'm missing the initial probe attribute, this is a sample of a probe:

#probe login, use /login to probe
probe login_probe {
        .url = "/login";
        .interval = 15s;
        .timeout = 3s;
        .window = 5;
        .threshold = 3;
}

before searching for other i try to set the .initial, na then, if the issue is not solved i gave all other info.

Thanks

ghost commented 6 years ago

So, even setting initial to 5 this does not solve the issue: the first load of a page give me the service page for unhealty backend.

This is the relevant part of the configuration:

#director
new phpipam = dynamic.director(
        share = HOST,
        whitelist = s2,
        port = "8184",
        first_byte_timeout = 10m,
        connect_timeout = 5s,
        between_bytes_timeout = 1m,
        probe = loginslash_probe,
        ttl = 5m
);

#probe loginslash, use /login/ to probe
probe loginslash_probe {
        .url = "/login/";
        .interval = 15s;
        .timeout = 3s;
        .window = 5;
        .threshold = 3;
        .initial = 5;
}

#internal ipam service
if (req.http.host ~ "^ipam.diennea.lan$") {

        include "/etc/varnish/acl/https-301.vcl"; #this is a simple http to https redirect

        set req.backend_hint = phpipam.backend("dockerkube-lb.dna");
}

Unfortunately i'm not finding any call to this service in the logs, an this is really weird because the service is working.

nigoroll commented 6 years ago

varnishlog should always tell you what's happening

ghost commented 6 years ago

So....using varnishlog i see this message:

Anyway i've captured the relevant part

*   << BeReq    >> 266160
-   Begin          bereq 266159 pass
-   Timestamp      Start: 1521185882.838929 0.000000 0.000000
-   BereqMethod    GET
-   BereqURL       /login/timeout/
-   BereqProtocol  HTTP/1.1
-   BereqHeader    Host: ipam.diennea.lan
-   BereqHeader    X-Nginx-Forwarded-For: 10.168.1.173
-   BereqHeader    X-Forwarded-Proto: https
-   BereqHeader    X-Server-Ip: 10.168.10.205
-   BereqHeader    user-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0
-   BereqHeader    accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
-   BereqHeader    accept-language: en-US,en;q=0.5
-   BereqHeader    accept-encoding: gzip, deflate, br
-   BereqHeader    referer: https://ipam.diennea.lan/
-   BereqHeader    upgrade-insecure-requests: 1
-   BereqHeader    pragma: no-cache
-   BereqHeader    cache-control: no-cache
-   BereqHeader    cookie: _ga=GA1.2.738750100.1513093686; table-page-size=50; __utma=144425325.738750100.1513093686.1521132450.1521185507.6; __utmz=144425325.1519902084.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=144425
325; _gid=GA1.2.712278224.1521040902; p
-   BereqHeader    X-Received-Forwarded-For: ::1
-   BereqHeader    X-Forwarded-For: 10.168.1.173
-   BereqHeader    X-Varnish: 266160
-   VCL_call       BACKEND_FETCH
-   VCL_return     fetch
-   BackendOpen    126 reload_2018-03-16T083754.fenoavail 10.168.10.124 8083 10.168.10.207 55182
-   BackendStart   10.168.10.124 8083
-   Timestamp      Bereq: 1521185882.839544 0.000615 0.000615
-   Timestamp      Beresp: 1521185882.840492 0.001562 0.000948
-   BerespProtocol HTTP/1.0
-   BerespStatus   503
-   BerespReason   Service Unavailable
-   BerespHeader   Date: Fri, 16 Mar 2018 07:38:02 GMT
-   BerespHeader   Server: Apache/2.2.15 (CentOS)
-   BerespHeader   Cache-Control: no-cache, no-store
-   BerespHeader   Content-Length: 1364
-   BerespHeader   Connection: close
-   BerespHeader   Content-Type: text/html; charset=UTF-8
-   VCL_call       BACKEND_RESPONSE
-   TTL            VCL 0 0 0 1521185883
-   VCL_return     deliver
-   BerespUnset    Content-Length: 1364
-   BerespHeader   Content-Encoding: gzip
-   BerespHeader   Vary: Accept-Encoding
-   Storage        malloc Transient
-   ObjProtocol    HTTP/1.0
-   ObjStatus      503
-   ObjReason      Service Unavailable
-   ObjHeader      Date: Fri, 16 Mar 2018 07:38:02 GMT
-   ObjHeader      Server: Apache/2.2.15 (CentOS)
-   ObjHeader      Cache-Control: no-cache, no-store
-   ObjHeader      Content-Type: text/html; charset=UTF-8
-   ObjHeader      Content-Encoding: gzip
-   ObjHeader      Vary: Accept-Encoding
-   Fetch_Body     3 length -
-   Gzip           G F E 1364 763 80 6024 6034
-   BackendClose   126 reload_2018-03-16T083754.fenoavail
-   Timestamp      BerespBody: 1521185882.841104 0.002175 0.000612

maybe this is related to or particular setup: we have a default backend thath will be sete prior to searching for appropriate host:

    set req.backend_hint=notavailable404.backend();

    include "/etc/varnish/requests/requests.vcl";

in requests.vcl we have definition for our vhost that, if match is found, select the appropriate backend, the theory that was working well with varnish 2.0 to varnish 4.1 is that notavailable404 well be used only if no one of the vhost in requests.vcl match the requested host, so i think thath with the new varnish this setup is not working as with older version.

do you have any hint about this?

Other info: with the debug features of firefox i see that the first request give a 503, but this seems a nonsense :|

ghost commented 6 years ago

Some other infos: changing the timeout dosen't solve the problem.

commenting out set req.backend_hint=notavailable404.backend(); dosen't solve

Other service that use the standard round robin director aren't affected for the problem.

Another service which use dynamic director dosen't show the error page after reload, bu we get kicked off from the session-

Maybe this is an interaction between the director and our setup, but isin't easy to understand what is.

nigoroll commented 6 years ago

Accourding to the varnishlog, your backend explicitly sends a 503. I see no connection to vmod_dynamic. I cannot help you unless you show me a varnishlog which shows the issue. There are lots of options to query it specifically, including option -d to use the history still in memory.

ghost commented 6 years ago

Yes, the problem is that varnish dosen't try the right backend ad first time, and use the fenoavail, and is not related to the vmod_dynamic

Close the issue, and thanks for your support

ghost commented 6 years ago

Sorry, i'm back and i've found where the glitch is, but not why it happens

For some corporate reasons we have some backend servers that serve specific error pages, and we have made in this way:

defaul.vcl

include "/etc/varnish/requests/requests.vcl"; include "/etc/varnish/errpages/conf.vcl";

the errpages/conf.vcl is


if (!std.healthy(req.backend_hint)) {
        set req.backend_hint = maintswitch.backend();
        if (!std.healthy(req.backend_hint)) {
                if (req.http.X-ErrType == "backend") {
                        unset req.http.X-ErrType;
                        set req.backend_hint = notavailablebe.backend();
                }
                elsif (req.http.X-ErrType == "ducati") {
                        set req.backend_hint = ducatiupdate.backend();
                }
                else {
                        set req.backend_hint = notavailablefe.backend();
                }
        }
        else {
                if (req.http.X-ErrType == "backend") {
                        unset req.http.X-ErrType;
                        set req.backend_hint = maintenancebe.backend();
                }
                elsif (req.http.X-ErrType == "ducati") {
                        set req.backend_hint = ducatiupdate.backend();
                }
                else {
                        set req.backend_hint = maintenancefe.backend();
                }
        }
}

so the problem is that at the first connection

if (!std.healthy(req.backend_hint))

return true (backend unhealty) for backends server by vmod_dynamic, an varnish correctly use the

notavailablefe.backend();

which returns the 503 as in varnishlog

as confirmation if i comment out the

include "/etc/varnish/errpages/conf.vcl";

the first call after a vcl reload works well as expected

hope to have explained well the situation

I know is a compelx setup, and i'm actively searching to find a better way to make what we need.

ghost commented 6 years ago

@nigoroll have you seen my last message, there is anything you can say about?

nigoroll commented 6 years ago

It sounds like you would want to set .initial >= .threshold for your probe.

ghost commented 6 years ago

Ummh, already tries intial = thresold with no success, i try with > and see if it works

nigoroll commented 6 years ago

Also check varnishlog, in particular the Backend_health tag. .initial == .threshold should suffice, actually.

ghost commented 6 years ago

So, with initial == window it works, no service down after reload. I will do more tests as we are under a lighter workloa.

Very thanks, and shame on me because i didn't have tried this config before.

ghost commented 6 years ago

Sigh, no, the problem isin't solved.

I've done a varnish reload and capured some logs

these ar with varnishlog -g raw -q '* ~ vmod-dynamic'

Log abandoned
Log reacquired
Log abandoned
Log reacquired
         0 Timestamp      - vmod-dynamic reload_2018-03-30T092814.phpipam(dockerkube-lb.dna) Lookup: 1522394904.558255 0.000000 0.000000
         0 Timestamp      - vmod-dynamic reload_2018-03-30T092814.phpipam(dockerkube-lb.dna) Results: 1522394904.562455 0.004200 0.004200
         0 Timestamp      - vmod-dynamic reload_2018-03-30T092814.phpipam(dockerkube-lb.dna) Update: 1522394904.568008 0.009752 0.005552

and these with varnishlog -g raw | grep Backend_health | grep -i ipam

         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.51) Still healthy 4--X-RH 5 5 5 0.124471 0.099950 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.50) Still healthy 4--X-RH 5 5 5 0.121646 0.104392 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.56) Still healthy 4--X-RH 5 5 5 0.135277 0.107116 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.52) Still healthy 4--X-RH 5 5 5 0.130498 0.105702 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.50) Still healthy 4--X-RH 5 5 5 0.093696 0.107332 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.56) Still healthy 4--X-RH 5 5 5 0.120738 0.114492 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.52) Still healthy 4--X-RH 5 5 5 0.088463 0.108676 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.51) Still healthy 4--X-RH 5 5 5 0.100355 0.113608 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.51) Still healthy 4--X-RH 5 5 5 0.107402 0.101813 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.50) Still healthy 4--X-RH 5 5 5 0.118870 0.108011 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.56) Still healthy 4--X-RH 5 5 5 0.107264 0.107153 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.52) Still healthy 4--X-RH 5 5 5 0.123834 0.110235 HTTP/1.1 200 OK
Log abandoned
Log reacquired
Log abandoned
Log reacquired
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.50) Still healthy 4--X-RH 5 5 5 0.103631 0.106407 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.56) Still healthy 4--X-RH 5 5 5 0.122540 0.116504 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.52) Still healthy 4--X-RH 5 5 5 0.112713 0.109685 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.51) Still healthy 4--X-RH 5 5 5 0.096416 0.109310 HTTP/1.1 200 OK
Log abandoned
Log reacquired
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.50) Still healthy ------H 5 5 5 0.000000 0.000000
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.50) Still healthy ------H 5 5 5 0.000000 0.000000
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.51) Still healthy ------H 5 5 5 0.000000 0.000000
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.51) Still healthy ------H 5 5 5 0.000000 0.000000
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.52) Still healthy ------H 5 5 5 0.000000 0.000000
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.52) Still healthy ------H 5 5 5 0.000000 0.000000
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.56) Still healthy ------H 5 5 5 0.000000 0.000000
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.56) Still healthy ------H 5 5 5 0.000000 0.000000
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.52) Still healthy 4--X-RH 5 5 5 0.131276 0.032819 HTTP/1.1 200 OK
Log abandoned
Log reacquired
         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.51) Still healthy 4--X-RH 5 5 5 0.132275 0.109428 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.56) Still healthy 4--X-RH 5 5 5 0.100486 0.105486 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.50) Still healthy 4--X-RH 5 5 5 0.106332 0.107591 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T092814.phpipam.dockerkube-lb.dna(10.168.11.52) Still healthy 4--X-RH 5 5 5 0.126403 0.114277 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.50) Still healthy 4--X-RH 5 5 5 0.169330 0.122138 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.56) Still healthy 4--X-RH 5 5 5 0.164584 0.128524 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.52) Still healthy 4--X-RH 5 5 5 0.173511 0.125642 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T091328.phpipam.dockerkube-lb.dna(10.168.11.51) Still healthy 4--X-RH 5 5 5 0.169129 0.124265 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.51) Still healthy 4--X-RH 5 5 5 0.173798 0.066378 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.56) Still healthy 4--X-RH 5 5 5 0.180521 0.065793 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.52) Still healthy 4--X-RH 5 5 5 0.171341 0.067449 HTTP/1.1 200 OK
         0 Backend_health - reload_2018-03-30T093035.phpipam.dockerkube-lb.dna(10.168.11.50) Still healthy 4--X-RH 5 5 5 0.195845 0.073446 HTTP/1.1 200 OK

So the backend seems always fine, but the first call to

std.healthy(req.backend_hint)

returns false, and the 503 error page is displayed.

dridi commented 6 years ago

One thing that could cause the std.healthy to report a sick director is the lack of backends until the first lookup completes. According to your timestamps it's a 10ms window.

There is a parameter for that:

https://github.com/nigoroll/libvmod-dynamic/blob/8a71eb14cd4e5d5d1766995ffa7d288b41871bb8/src/vmod_dynamic.vcc#L216-L220

However, while I'm sure it applies to backend selection, maybe it should also be honored by the health code path.

Backend selection does it:

https://github.com/nigoroll/libvmod-dynamic/blob/8a71eb14cd4e5d5d1766995ffa7d288b41871bb8/src/vmod_dynamic.c#L118-L124

Nothing is done once the lock is acquired in the dynamic_healthy callback:

https://github.com/nigoroll/libvmod-dynamic/blob/8a71eb14cd4e5d5d1766995ffa7d288b41871bb8/src/vmod_dynamic.c#L167-L173

@nigoroll: sounds legit to me.

ghost commented 6 years ago

@Dridi @nigoroll So there is a plan to change the beaviour of the dynamic_healthy ?