Unexpected VCL subroutine flow

Vormillion commented 1 month ago

Expected Behavior

Waitinglist requests should not be considered as cached.

Current Behavior

After purge request to URL, next request is getting MISS and backend fetch as expected, but next requests are getting HIT VCL_call although they are in waitinglist.

Possible Solution

No response

Steps to Reproduce (for bugs)

No response

Context

We are trying to troubleshoot scenario, when varnishlog is showing that object was served from cache but in reality fetch time was quite long due to waitinglist.

We are setting special response header called X-Cache-Status:

sub vcl_deliver {
    if (obj.hits > 0) {
       set resp.http.X-UPS = "V";
       set resp.http.X-Cache-Status = "HIT";
       unset resp.http.X-UPS-RESTIME;
    } else {
       set resp.http.X-Cache-Status = "MISS";
       set resp.http.X-UPS = "M";
    }
}

Varnishlog output showing cache purge -> miss on first call -> hit on second call while request is in waitinglist so in reality it's not HIT, so my question is why VCL_HIT is called here.

Cache purge request

*   << Request  >> 862691    
-   Begin          req 862690 rxreq
-   Timestamp      Start: 1723626741.517823 0.000000 0.000000
-   Timestamp      Req: 1723626741.517823 0.000000 0.000000
-   VCL_use        varnish_1723557604
-   ReqStart       10.26.31.38 54314 a0
-   ReqMethod      PURGE
-   ReqURL         /
-   ReqProtocol    HTTP/1.1
-   ReqHeader      X-Magento-Tags-Pattern: ((^|,)cat_c_p_22(,|$))
-   ReqHeader      Host: XXX
-   ReqHeader      X-Forwarded-For: XXX
-   VCL_call       RECV
-   ReqHeader      disable_http_basic: 1
-   VCL_return     synth
-   VCL_call       HASH
-   VCL_return     lookup
-   RespProtocol   HTTP/1.1
-   RespStatus     200
-   RespReason     Purged
-   RespHeader     Date: Wed, 14 Aug 2024 09:12:21 GMT
-   RespHeader     Server: Varnish
-   RespHeader     X-Varnish: 862691
-   VCL_call       SYNTH
-   RespHeader     Content-Type: text/html; charset=utf-8
-   RespHeader     Retry-After: 5
-   VCL_return     deliver
-   Timestamp      Process: 1723626741.517868 0.000045 0.000045
-   RespHeader     Content-Length: 241
-   Storage        malloc Transient
-   Filters        
-   RespHeader     Connection: keep-alive
-   Timestamp      Resp: 1723626741.517892 0.000069 0.000024
-   ReqAcct        95 0 95 197 241 438
-   End

Fresh call to /napoje - this is MISS as expected

*   << Request  >> 2387316   
-   Begin          req 862673 rxreq
-   Timestamp      Start: 1723626735.944309 0.000000 0.000000
-   Timestamp      Req: 1723626735.944309 0.000000 0.000000
-   VCL_use        varnish_1723557604
-   ReqStart       10.26.19.1 34820 a0
-   ReqMethod      GET
-   ReqURL         /napoje
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: XXX
-   ReqHeader      X-Request-ID: 5b2a2082bd338f875428ed54a37213d4
-   ReqHeader      X-Real-IP: XXX
-   ReqHeader      X-Forwarded-For: XXX
-   ReqHeader      X-Forwarded-Host: XXX
-   ReqHeader      X-Forwarded-Port: 443
-   ReqHeader      X-Forwarded-Proto: https
-   ReqHeader      X-Forwarded-Scheme: https
-   ReqHeader      X-Scheme: https
-   ReqHeader      X-Original-Forwarded-For: XXX
-   ReqHeader      Sec-CH-UA: "Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"
-   ReqHeader      sec-ch-ua-mobile: ?0
-   ReqHeader      sec-ch-ua-platform: "Linux"
-   ReqHeader      upgrade-insecure-requests: 1
-   ReqHeader      User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
-   ReqHeader      Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
-   ReqHeader      sec-fetch-site: same-origin
-   ReqHeader      sec-fetch-mode: navigate
-   ReqHeader      sec-fetch-user: ?1
-   ReqHeader      sec-fetch-dest: document
-   ReqHeader      Accept-Language: pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7
-   ReqHeader      priority: u=0, i
-   ReqHeader      Accept-Encoding: gzip
-   ReqHeader      Cache-Control: no-cache, max-age=0
-   VCL_call       RECV
-   ReqUnset       X-Forwarded-For: XXX
-   ReqHeader      X-Forwarded-For: XXX
-   ReqUnset       X-Real-IP: XXX
-   ReqHeader      X-Real-Ip: XXX
-   ReqURL         /napoje
-   ReqUnset       Accept-Encoding: gzip
-   ReqHeader      Accept-Encoding: gzip
-   ReqURL         /napoje
-   VCL_return     hash
-   VCL_call       HASH
-   VCL_return     lookup
-   VCL_call       MISS
-   VCL_return     fetch
-   Link           bereq 2387317 fetch
-   Timestamp      Fetch: 1723626742.356807 6.412498 6.412498
-   RespProtocol   HTTP/1.1
-   RespStatus     200
-   RespReason     OK
-   RespHeader     Server: nginx
-   RespHeader     Date: Wed, 14 Aug 2024 09:12:22 GMT
-   RespHeader     Content-Type: text/html; charset=UTF-8
-   RespHeader     Edge-Control: bypass-cache
-   RespHeader     X-Content-Type-Options: nosniff
-   RespHeader     X-XSS-Protection: 1; mode=block
-   RespHeader     X-Frame-Options: SAMEORIGIN
-   RespHeader     Vary: Accept-Encoding
-   RespHeader     X-UPS-RESTIME: 6.354
-   RespHeader     X-UPS: M
-   RespHeader     Pragma: no-cache
-   RespHeader     Expires: -1
-   RespHeader     Cache-Control: no-store, no-cache, must-revalidate, max-age=0
-   RespHeader     backend: default
-   RespHeader     Content-Encoding: gzip
-   RespHeader     X-Varnish: 2387316
-   RespHeader     Age: 0
-   RespHeader     Accept-Ranges: bytes
-   VCL_call       DELIVER
-   RespUnset      Age: 0
-   RespHeader     X-Cache-Status: MISS
-   RespUnset      X-UPS: M
-   RespHeader     X-UPS: M
-   RespUnset      Server: nginx
-   RespUnset      X-Varnish: 2387316
-   VCL_return     deliver
-   Timestamp      Process: 1723626742.356844 6.412534 0.000036
-   Filters        
-   RespHeader     Content-Length: 114887
-   RespHeader     Connection: keep-alive
-   Timestamp      Resp: 1723626742.356931 6.412621 0.000087
-   ReqAcct        2416 0 2416 6231 114887 121118
-   End

Second call to /napoje (in the meantime previous request is still fetching page from backend ~6 seconds)

*   << Request  >> 1538196   
-   Begin          req 1538181 rxreq
-   Timestamp      Start: 1723626736.342492 0.000000 0.000000
-   Timestamp      Req: 1723626736.342492 0.000000 0.000000
-   VCL_use        varnish_1723557604
-   ReqStart       10.26.11.78 54734 a0
-   ReqMethod      GET
-   ReqURL         /napoje
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: XXX
-   ReqHeader      X-Request-ID: cf20f0e2e9ef40211ae49485149abe50
-   ReqHeader      X-Real-IP: XXX
-   ReqHeader      X-Forwarded-For: XXX
-   ReqHeader      X-Forwarded-Host: XXX
-   ReqHeader      X-Forwarded-Port: 443
-   ReqHeader      X-Forwarded-Proto: https
-   ReqHeader      X-Forwarded-Scheme: https
-   ReqHeader      X-Scheme: https
-   ReqHeader      X-Original-Forwarded-For: XXX
-   ReqHeader      Sec-CH-UA: "Not)A;Brand";v="99", "Android WebView";v="127", "Chromium";v="127"
-   ReqHeader      sec-ch-ua-mobile: ?1
-   ReqHeader      sec-ch-ua-platform: "Android"
-   ReqHeader      upgrade-insecure-requests: 1
-   ReqHeader      User-Agent: Mozilla/5.0 (Linux; Android 11; SM-A202F Build/RP1A.200720.012; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/127.0.6533.97 Mobile Safari/537.36 [FB_IAB/FB4A;FBAV/476.0.0.49.74;]
-   ReqHeader      Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
-   ReqHeader      Accept-Language: cs-CZ,cs;q=0.9,en-US;q=0.8,en;q=0.7
-   ReqHeader      priority: u=0, i
-   VCL_call       RECV
-   ReqUnset       X-Forwarded-For: XXX
-   ReqHeader      X-Forwarded-For:XXX
-   ReqUnset       X-Real-IP: XXX
-   ReqHeader      X-Real-Ip: XXX
-   ReqURL         /napoje
-   ReqUnset       Accept-Encoding: gzip
-   ReqHeader      Accept-Encoding: gzip
-   ReqURL         /napoje
-   VCL_return     hash
-   VCL_call       HASH
-   VCL_return     lookup
-   Hit            2387317 86405.956800 60.000000 0.000000 114887
-   Timestamp      Waitinglist: 1723626742.356823 6.014330 6.014330
-   VCL_call       HIT
-   VCL_return     deliver
-   RespProtocol   HTTP/1.1
-   RespStatus     200
-   RespReason     OK
-   RespHeader     Server: nginx
-   RespHeader     Date: Wed, 14 Aug 2024 09:12:22 GMT
-   RespHeader     Content-Type: text/html; charset=UTF-8
-   RespHeader     Edge-Control: bypass-cache
-   RespHeader     X-Content-Type-Options: nosniff
-   RespHeader     X-XSS-Protection: 1; mode=block
-   RespHeader     X-Frame-Options: SAMEORIGIN
-   RespHeader     Vary: Accept-Encoding
-   RespHeader     X-UPS-RESTIME: 6.354
-   RespHeader     X-UPS: M
-   RespHeader     Pragma: no-cache
-   RespHeader     Expires: -1
-   RespHeader     Cache-Control: no-store, no-cache, must-revalidate, max-age=0
-   RespHeader     backend: default
-   RespHeader     Content-Encoding: gzip
-   RespHeader     X-Varnish: 1538196 2387317
-   RespHeader     Age: 0
-   RespHeader     Accept-Ranges: bytes
-   VCL_call       DELIVER
-   RespUnset      Age: 0
-   RespUnset      X-UPS: M
-   RespHeader     X-UPS: V
-   RespHeader     X-Cache-Status: HIT
-   RespUnset      X-UPS-RESTIME: 6.354
-   RespHeader     Referrer-Policy: no-referrer-when-downgrade
-   RespUnset      Server: nginx
-   RespUnset      X-Varnish: 1538196 2387317
-   VCL_return     deliver
-   Timestamp      Process: 1723626742.356853 6.014360 0.000030
-   Filters        
-   RespHeader     Content-Length: 114887
-   RespHeader     Connection: keep-alive
-   Timestamp      Resp: 1723626742.356943 6.014450 0.000089
-   ReqAcct        3634 0 3634 6208 114887 121095
-   End

Varnish Cache version

7.5.0

Operating system

Debian 12

Source of binary packages used (if any)

No response

gquintard commented 1 month ago

slightly relevant blog post: https://info.varnish-software.com/blog/using-obj-hits

it is a hit though: the object is found in the cache, albeit in a currently incomplete form. One could argue that vcl_hit could offer a obj.still_downloading boolean property, though

nigoroll commented 1 month ago

side note: the bereq (for example via varnishlog -g request) would have been helpful.

asadsa92 commented 1 month ago

Test case (written by @AlveElde , modified it slithly myself):

varnishtest "Cache HIT on shortlived error object with waitinglist"

server s1 {
    rxreq
    delay 10
} -start

varnish v1 -vcl+backend {
    sub vcl_backend_fetch {
        set bereq.first_byte_timeout = 1s;
    }

    sub vcl_deliver {
        set resp.http.hits = obj.hits;
        set resp.http.uncacheable = obj.uncacheable;
        set resp.http.x-cache = req.http.x-cache;
    }

    sub vcl_hit {
        set req.http.x-cache = "HIT";
    }

    sub vcl_miss {
        set req.http.x-cache = "MISS";
    }
} -start

client c1 {
    txreq
    rxresp
    expect resp.status == 503
    expect resp.http.hits == 0
    expect resp.http.uncacheable == false
    expect resp.http.x-cache == "MISS"
} -start

delay 0.5

client c2 {
    txreq
    rxresp
    expect resp.status == 503
    expect resp.http.hits == 1
    expect resp.http.uncacheable == false
    expect resp.http.x-cache == "HIT"
} -start

client c1 -wait
client c2 -wait

varnish v1 -expect MAIN.cache_hitmiss == 0
varnish v1 -expect MAIN.cache_miss == 1
varnish v1 -expect MAIN.cache_hit == 1
varnish v1 -expect MAIN.cache_hit_grace == 0
varnish v1 -expect MAIN.busy_sleep == 1

bsdphk commented 1 month ago

Bugwash: Things seem to work as they should, but we may be calling something a hit (VSC) which is not what users would think of as "hit" since delivery isn't immediate.

Return to this after thinking/looking at the statistics.

nigoroll commented 4 weeks ago

I would suggest the following:

rename MAIN.busy_sleep to MAIN.cache_hit_coalesce
add obj.was_coalescing analogous to obj.is_hitmiss etc.

but regarding the OPs main point, I think that a coalescing (aka waitinglist) hit will need to continue ending up in vcl_hit, because it is that, even if it has to wait.

dridi commented 6 days ago

I would need to dig up my 242-slides presentation from a couple VDDs ago but I'm pretty sure something like that was part of the plan laid out.

dridi commented 6 days ago

I couldn't find such a flag in my presentation.

I would suggest the following:

* rename `MAIN.busy_sleep` to `MAIN.cache_hit_coalesce`

* add `obj.was_coalescing` analogous to `obj.is_hitmiss` etc.

If we rename a busy counter, we should rename all of them to reflect their relation to the waiting list so I would prefer something like cache_hit_waitlist and other counters renamed to waitlist_something. See #4073 for the gory details, but "busy" was conflated between busyobj and waiting list and created a confusion in this area.

Regarding obj.was_coalescing I think this is taking the problem from the wrong end and I would prefer something like req.was_waiting to better reflect what went on.

nigoroll commented 5 days ago

I am also ok with @dridi's suggestion to consistently rename to waitlist instead of coalesce, if that was thought to be more descriptive. My personal opinion is that the waiting list is only the implementation, but the mechanism is coalescing, so I prefer the latter. Regarding the flag to reside under obj or req, I it is related: req.was_waiting somehow makes sense but sounds very general ("and for what was the request waiting?") while req.was_coalescing in my mind also takes the problem from the wrong end, because it was the object that was the result of coalescing backend requests. On a side note, for restarts I think we would need to clear a request flag to signal exactly during which iteration the waitinglist was entered, and that, again, seems un-pola to me, because then the request might report was_waiting = false when in fact it did before. So all in all, obj.was_coalescing actually seems to be more descriptive in my mind still, but I can go with whatever.

dridi commented 5 days ago

I think I'm actually against cache_hit_waitlist because <counter>_<specialization> is supposed to represent a subset of <counter> and a waiting list hit is just a hit, so it could very well be a grace hit, and it shouldn't be presented as a sibling of cache_hit_grace.

If we touch one of the busy_* counters, they should probably all use the same waitlist_ or waitinglist_ prefix. For now we should probably leave them alone to avoid breaking setups for no good reason (we don't have VSC aliases for compatibility).

Regarding the original problem of long hit transactions, the same can happen without involving a waiting list. A hit on an object in the BOS_STREAM state (ongoing fetch) can also appear to take a long time in varnishlog. Imagine a backend serving the response headers fast, but taking a long time to produce the response body, or simply a slow or long fetch in general.

Now regarding req.<new_waitinglist_flag_tbd>, something potentially more useful could be instead a new variable req.waiting_list_time telling you the amount of time spent in a waiting list. You could probably already craft this by storing now in vcl_recv and computing the difference in vcl_{hit,miss,pass}. Keeping track of a duration can also translate fairly easily to a boolean:

if (req.waiting_list_time > 0) {
        # ...
}

I'm aware that a DURATION called "something_time" can be confusing, I'm just not making an effort to find a name for the concept.

varnishcache / varnish-cache