Backend request coalescing fail / vcl_hit{} returns miss without busy object. Doing pass.

bsdphk commented 8 years ago

Old ticket imported from Trac See archived copy here: https://varnish-cache.org/trac/ticket/1799

nigoroll commented 8 years ago

Here's another workaround:

mark return(miss) cases in vcl_hit with a req header
in vcl_pass, check for this header and, if present, return(restart)
in vcl_miss, clear this header
in vcl_recv, check for restart and the presence of the marker header and and set hash_always_miss

sub vcl_hit {
    // ...
    set req.http.workaround-1799 = "fixup";
    return (miss);
}
sub vcl_pass {
    if (req.http.workaround-1799 && req.restarts < 2) {
        return (restart);
    }
}
sub vcl_miss {
    unset req.http.workaround-1799;
}
sub vcl_recv {
    if (req.restarts >= 1 && req.http.workaround-1799) {
        set req.hash_always_miss = true;
    } else {
        unset req.http.workaround-1799
    }
}

Note that for configurations with more than one backend, this will only work as such if either bereq.backend gets set on the backend side or if req.backend_hint gets restored before the vcl_recv snippet. See #2405

nigoroll commented 8 years ago

saving irc discussion with @mbgrydeland and @bsdphk for easy reference

(13:46:48) martin: was this alleviated by the hfp->hfm change? (13:47:01) slink: So I was thinking about a couple of options with this in mind and the hit-for-pass vs. hit-for-miss: with the current hit-for-miss, we insert new objects all the time, which might also turn out to be problematic for some cases. So the one most drastic question is: should we have a late cache insert after fetch? (13:47:56) slink: martin its just something i thought about in the new context (13:48:23) slink: and I dont think 1799 has got any better (13:48:43) slink: miss from hit is still a pass (13:48:51) martin: yeah, i think you're right. that one is still open (13:49:13) martin: a late cache insert would make the request coalescing fail wouldn't it? (13:49:37) slink: yes, but only for the case where you have an object and decide to miss anyway (13:51:29) slink: martin: hm but yes, I wonder of there is a better option... (13:52:04) martin: this is tricky stuff, really touches on several areas that need to be just right for things work out (13:52:15) slink: martin: absolutely (13:53:38) slink: But I think there are some basic questions: for coalescing we need a busy object. So we have the options of inserting one which we may not need or inserting it later and accept the risk that we may miss out for some time (13:53:56) slink: "always" inserting one was an option phk really had concerns about (13:54:09) slink: but basically hfm is not that far away from it (13:56:57) martin: slink: in your writeup of the problem in the email to -dev (13:57:05) martin: the vcl snippet (13:57:16) martin: if tests on obj.http.Criterium (13:57:24) martin: what would that typically be? (13:57:56) martin: eh (13:57:58) martin: ignore me (13:58:04) martin: i got it now (13:58:18) slink: martin: :) (13:58:25) martin: thats the test for whether we want hit-for-pass or hit-for-miss behaviour (13:58:30) slink: martin: yes (13:59:04) slink: for most vcls, it would probably just a plain if (obj.uncacheable) { return (miss) } (14:00:18) slink: how strong are your concerns regarding the race if entering lookup twice? ... (14:02:20) martin: we already do enter lookup multiple times (14:02:35) martin: every time a req is woken from WL it will reenter lookup (14:02:46) martin: taking the locks and reevaluate whats on the list (14:02:49) slink: yes (14:03:21) slink: just for the WL case that is (14:03:45) slink: I was thinking along your concerns in 1799 (14:03:58) slink: "I've been pondering how to fix this. But all attempts at reentering the waitinglist on the OH after HSH_Lookup has completed seems to open race conditions." (14:04:24) slink: re-introducing req.grace would appear to solve the case, but then we remove the decision from hit again (14:04:54) slink: and the hfm/hfp case lets me think that we should have more decision points in vcl, not less ... (14:05:41) slink: So my story goes: We could have a late insert after fetch, but for the case of miss from hit, we could also just have another lookup (14:06:05) slink: then we could get to hit again and we'd certainly need a recursion cutoff, but would that be so bad? (14:09:23) martin: i still don't fully get your thinkin (14:09:48) martin: the late insert - that's a late insertion of an objcore on the objhead right? (14:09:57) slink: yes (14:10:21) martin: so thats an objcore that was initially a miss (14:10:30) slink: pass (14:10:47) martin: a pass? so it belongs to a completely different objhead then (14:11:18) slink: Let me try to summarize: ... (14:13:30) slink: I think there are different options: If we never wanted to leave hit without an oc, we'd need to insert one all the time, which I am not sure about if it's a good idea. So I just asked myself why we wouldn't want to take an OC_F_PRIVATE and insert it later. (14:15:05) slink: this would allow to leave hit without an oc and still change our mind, which would also be a way to hit the hfm case as well as expplicit return(miss) from hit. It wouldn't guarantee to give coalescing, there would be a window when additional misses happen. (14:16:07) slink: so the other idea was that we could lookup again to get a busy oc after miss from hit, which also leaves a race window, but much smaller (14:16:40) slink: so my question is what your concerns are regarding these races. (14:20:27) martin: the first question that comes to mind is what to do when it appears the world has changed under our feet between the first and next lookup (14:20:47) martin: when the design is that vcl_hit should make the decision (14:20:59) martin: that would mean that we would need to reenter vcl_hit as well (14:21:27) slink: exactly (14:21:28) martin: the lookup might just as well be a proper cache hit at this point and no fetch is needed after all (14:21:34) slink: yes (14:22:03) martin: i believe this was an option back in the day when we were discussing 1799 (14:22:17) martin: for some reason it was discarded though (14:22:30) martin: and i'm failing at remembering exactly why that idea was discarded (14:22:35) slink: IIUC phk he was unhappy about the enter-vcl-hit-multuple times (14:22:52) martin: could be (14:22:58) martin: it does feel a bit hackish (14:23:27) martin: later the idea to actually hold the objhead mutex across the VCL call was floated (14:23:42) martin: giving the user still full control (14:23:51) martin: but eliminating the races (14:24:10) martin: obvious downside to this is when vcl_hit contains curl vmod calls (14:24:26) martin: and the extending of critical region that would give (14:24:37) slink: yes, but OC_F_DONTUSE sounds reasonable (14:25:02) slink: for the case you mentioned and I think it could also cover the hfp/hfm case (14:25:31) slink: so do you think that extending the critical region is better than the race? (14:25:53) martin: OC_F_DONTUSE has also mostly been discarded as an option (14:26:02) slink: ah ok (14:26:17) martin: because the vcl code then doesn't keep its descisions local to how to tread this request (14:26:35) martin: but will actually influence the cacheability of the object (14:26:55) martin: and the confusion that could give (14:27:26) martin: the whole of 1799 is really a case of 'damn it, all options sucks' (14:27:33) slink: um (14:28:24) slink: I had a bit of hope that we could get to a point that the race may not be that bad after all, but it sounds more like you'd really think we have to solve this differently. ... (14:29:55) martin: slink: i'm really uncertain how to fix it. if i were in charge, i'd go back to req.grace being an attribute to set beforehand, and the whole selection process returning back to C space where we would resolve all races under the mutex. (14:30:07) martin: but that would make vcl_hit pretty pointless (14:30:15) martin: so its far from an ideal solution (14:31:26) slink: martin: i absolutely understand this, but it not only would make vcl_hit pretty pointless, but have no perspective on hfm vs. hfp and the i-want-to-miss-anyway case which is important for some scenarios I work on (or rather: require a restart, as we do now) (14:31:56) martin: yeah (14:32:30) martin: so this is why the running vcl_hit under objhead->mtx actually becomes an attractive option (14:33:07) martin: with the caveat that doing anything but simple tests on object attributes while in vcl_hit is a bad idea ... (14:34:11) slink: and regarding hfm we either need to accept many inserts or implement some other cutoff just for this case ... (14:41:25) martin: for your 'check and decide miss or pass in vcl_hit' approach, doing the vcl_hit under mutex would create some headaches (14:41:42) martin: eh - or would it? ... (14:42:48) martin: we would then be holding the mtx of the objhead that belongs to the cache point (14:43:15) slink: martin can you help me understand the issue you are seeing? (14:43:49) martin: in 1799? (14:45:02) martin: i'm not really seeing an issue - i thought this exercise was to see the compatibility of the two cases in terms of proposed solutions (14:45:23) slink: yes (14:45:28) slink: that was my intention (14:45:39) slink: we should have a solution for both cases (14:46:10) slink: I still try to understand "for your 'check and decide miss or pass in vcl_hit' approach, doing the vcl_hit under mutex would create some headaches" (14:46:27) phk: back (14:47:21) martin: slink: i think i'll withdraw that statement (14:47:37) slink: martin: ok, what a relief :) (14:47:47) martin: hehe (14:48:05) phk: so, just to jump in here. (14:48:24) phk: I think I've come to the conclusion that calling vcl_hit from HSH_Lookup() is a no-go. (14:48:40) phk: but we may want to consider introducing a new vcl_candidate{} for that purpose. (14:48:58) slink: phk ah yes that was another alternative.... (14:49:15) phk: you'd better think thrice before doing something stupid there, and don't come to me whining about performance if you use it. (14:49:20) slink: But I don't see how that solves the miss from hit case (14:49:50) slink: oh wait (14:49:57) martin: i notice that vcl_candidate is not in plural (14:50:03) martin: so it'll be caleed numerous times then? (14:50:09) martin: once for each of the candidates? (14:50:41) phk: martin, yes, I don't think there is any alternative to that. (14:50:49) slink: we have no miss from hit then, right? So how would we address the new case we brought up Friday? (14:50:56) phk: mark, I really don't want VCL wandering the objhdr list on its own (14:51:26) martin: i can see that (14:51:43) martin: but what sort of return values would vcl_candidate have? (14:51:51) martin: {MISS,PASS,HIT}? (14:52:07) phk: martin, something like that. (14:52:27) slink: hm (14:53:38) martin: {MISS,PASS,HIT,PURGE}? (14:54:20) phk: purge only that obj or all on that objhdr ? (14:54:34) martin: only that obj i'd say (14:54:40) phk: and would purge still allow it to be IMS candidate ? (14:55:07) martin: no (14:55:41) martin: i'm not certain purge is important here (14:55:43) phk: (actually, that is probably something we should look at in general...) (14:56:03) martin: would be an awkward way to implement invalidation (14:56:49) slink: may I bother you with an idea? (14:56:49) slink: - document vcl_hit as (14:56:50) slink: - must be idempotent (14:56:50) slink: - must be as efficient as possible (14:56:50) slink: - oh gets a change number (14:56:50) slink: - incremented with every insert (14:56:50) slink: - first first lookup (14:56:50) slink: - get mtx (14:56:50) slink: - lookup (14:56:50) slink: - release mtx (14:56:50) slink: - call miss/hit (14:56:50) slink: - if hit comes back with miss: (14:56:50) slink: - second lookup (14:56:50) slink: - get mtx (14:56:50) slink: - check change number (14:56:50) slink: - if changed, call hit again with the lock held (14:57:36) slink: this way, for the base case we'd still be as efficient as we're now, and only for the hit->miss case we'd take the overhead for calling hit under the mtx (14:58:14) slink: also if hit was inefficient, we'd not be hit too hard (14:58:19) phk: slink, that algo has no graceful degradation, it eats more and more resources under heavy load (14:59:38) slink: phk it has an upper bound, doesn't it? (15:00:15) phk: it does, but I still don't like it. (15:00:34) phk: and adding a new vcl_foo{} is cheaper than untangling all the worlds vcl_hit{}'s (15:01:48) slink: ok, but actaully vcl_candidate{} is the same as vcl_hit but called under the mtx, right? (15:02:05) slink: then vcl_hit is obsolete, isn't it? (15:02:31) phk: slink, it may be, but it means peoples VCL will still work (15:02:38) slink: at least regarding the do-we-want-this-object decision (15:02:49) martin: if we have the vcl_candidate, that means that HSH_Lookup will have a list of OC candidates, and a user supplied decision for each of them (15:02:59) slink: peoples vcl with miss from hit still won't work, or will it? (15:03:04) martin: what rules would HSHLookup then apply in terms of selecting one of them? (15:03:16) dridi: phk: not if they have ttl/grace logic that happens before their own in vcl_hit (15:03:20) phk: martin, havn't thought that much about it yet (15:03:35) phk: martin, in reality, we could shift some of HSH_Lookup out to VCL if we did so (15:03:42) slink: phk but it's that kind of thinking we're into at the moment (15:03:46) phk: that would really allow people to shoot their feet of (15:04:18) martin: maybe GRACE should also be part of the decision attributes (15:04:52) slink: I am not sure if a vcl_candidate{} looking at just one object at a time can make any useful decision (15:04:53) ***phk has high priority interrupt: mom just dropped in (15:08:27) slink: phk ok i really fail to see how this approach would help. I'd need to see some vcl mock showing how it would look like for the 1799 case and the hfp vs hfm (15:08:36) slink: please

nigoroll commented 8 years ago

To summarize my understanding of the options we have discussed so far:

call vcl_hit while holding the oh->mtx in order to atomically insert a busy object if we need to
- likely to have negative performance impact
- dreadful if vcl_hit contains any long running calls (curl)
variation: only do this if opportunistic approach fails (irc transscript 14:56:49)
- @bsdphk doesn't like it
2nd round to lookup->hit (see bellow)
introduce vcl_candidate{} to be called from lookup under oh->mtx
- unclear at least to me how this would solve the issue

At this point I'd favor this approach:

officially require vcl_hit to be idempotent
after a miss from vcl_hit
- re-enter lookup
- either forcing a busy object insert
- or put req onto waitinglist if one exists
- if returned with a busy object, call vcl_hit again. Should it return miss again, we have a busy object to continue with, otherwise we drop it and continue with the hit.

michbsd commented 7 years ago

Is this bug still valid for 5.x ?

nigoroll commented 7 years ago

@michbsd yes

knan-linpro commented 6 years ago

PR#2519 helps, but doesn't totally cure the problem in our 4.1.9 overloaded-40Gbit testing (we probably hit the late expiry thread case, no keep configured), we still get the fetch without busy object VCL_Error log message and some request storms - but fewer than before.

PR#2519 combined with obj.ttl+obj.grace+1s > 0 return(deliver) in vcl_hit() the backend request storms are alleviated to the point they're no longer a problem for us.

hermunn commented 6 years ago

Just a small comment, @knan-linpro was testing https://github.com/hermunn/varnish-cache/commit/e3463e106e4a030673590f815f10e9f445436ac7, which is a patch on 4.1.9 that is very similar to #2519.

The fact that the patch was enough is interesting for both this ticket and #2519. We must be taking another timestamp between HSH_Lookup and vcl_hit so that the quiestion "Are we past grace?" does not get the same answer in the two places. This clearly demonstrates that #2519 is not enough (but surely helpful), and that we must take a deeper look at the time calculations.

I will continue looking into this and update either #2519 or this ticket when I find something.

hermunn commented 6 years ago

Unfortunately I agree with @nigoroll that we cannot close this, even with #2519 pushed. It helps, but I cannot find a good VCL workaround that works perfectly for the main use case - long grace for sick backends and short grace for healthy backends.

I have written about this in 1675a23 but not created a pull request. The pull request will be created when I have had a good sleep, pondered a bit, and read through everything again. Maybe I have to write a few test cases to verify that the text is correct. Feel free to comment anyway.

But let's talk about opportunities! Right now there are some possible solutions that come to mind, each with different downsides:

Reintroduce req.grace, and put the if (std.healthy(...)) part in sub vcl_recv. This takes away much of the purpose of vcl_hit and maybe some flexibility, but I can live with that. The big upside is that it is easy to explain to users - set req.grace if you want to override the grace. The lowest value of req.grace and obj.grace is always used, so that it is safe to set it low (but not zero) when we know that the backend is healthy.
Introduce a new variable, lets call it do_bgfetch in sub vcl_hit that will be true if the thread created the busy object and prepared the fetch, and false if there already was a busy object there. Then, setting do_bgfetch to false with subsequent return(deliver) (still in sun vcl_hit) will cancel the backend fetch and emancipate a request on the waiting list, which will be able to insert a new busy object on the waiting list and maybe do the same dance. Then one can set beresp.keep to a high value and explicitly set the variable to false and return(deliver) whenever the backend is sick. The downside to this is some serialization in Varnish (but only internally, for one specific resource, and I can live with that) and that it is somewhat hard to explain. The upside is that we do not remove flexibility from VCL and get a good solution for our use case.
Same as above, but with a new function std.forbid_bgfetch() instread of the variable. The same upside applies, but the VCL does not get to know if the request started a bgfetch or not.

nigoroll commented 6 years ago

vcl_candidate concept https://etherpad.wikimedia.org/p/varnish_vcl_candidate as of now:

vcl_recv
~~~~~~~~

* EOL:
    - hash_ignore_busy
    - hash_always_miss

vcl-built-in-subs.rst:

vcl_hit
~~~~~~~

  ``miss``
    *removed*

  ``deliver``
    use this object - decision on bgfetch is in vcl_candidate

vcl_candidate
~~~~~~~~~~~~~

Available objects:

    req.* (rw (?))
    obj.* (ro)
    obj.busy boolean

        if true, all oher obj.* attributes will be set to the empty string

        XXX fail the vcl ?

    obj.hitmiss
    obj.hitpass

    ## When we get an obj with busy set to true, do we have anything else in the object? I think "no", so then this is a trap that one can fall into. A different question is "Is there some busy object" on this OH (that matches Vary etc), but that will force a second run through the list.

Called during cache lookup on all objects satisfying the Vary criteria, including
objects which are currently being fetched (these are marked by obj.busy).

As this subroutine is called while holding a per hash value lock, it should
be implemented as efficiently as possible.

## XXX reconsider reader/writer locks? Problem is that there is no safe upgrade
## (which we would need)
## idea: make opportunistically lockless maybe with membars? - but that 

The `vcl_candidate` subroutine may terminate with calling ``return()`` with one
of the following keywords:

  ``fail``
    see  `fail`_

  ``hit``
    Enter :ref:`vcl_hit` with this object

## Hypothesis: A user assumes that the first object will be a hit, a busy object or there is no clean hit. For this reason he assumes that there will be no real hits, but that is wrong. We do not have a guarantee that the objects are sorted. Do we need hit in vcl_candidate when we have ignore? (If we do, then we need to doc that the order is not guaranteed because of possible persisted objects that appear out of nowhere).

## sort order is from inserts. Users who want a guarantee can just return (candidate(obj.ttl))

  ``miss`` ## I agree that we need it for hitmiss
    Enter :ref:`vcl_miss` with a new object immediately

  ``pass`` ## I agree that we need it for hitpass
    Enter :ref:`vcl_pass` with a new object immediately

  ``candidate(metric)``
    Score this object with _metric_ and continue. Only the candidate with
    the highest metric will be remembered and be used to enter ref:`vcl_hit`
    unless for some other candidate ``hit`` or ``miss`` is returned.

  ``ignore``
    Ignore this object

If, after calling vcl_candidate on all candidate objects, there was no explicit return
other than ``candidate(metric)`` or ``ignore``, the object with the highest candidate(metric)
is used to call vcl_hit. If there was no candidate object, vcl_miss is called (after
potentially calling vcl_miss for a refresh, see below).

In addition to the return values, a background refresh can be requested at
any time, which gets cleared by a return(miss) and return(pass)::

    set obj.refresh = true;

If obj.refresh is set, a call to vcl_miss will happen to prepare the request
with is_bgfetch = true (XXX or new sub parallel to vcl_miss?)

Examples:

* varnish 3-style req.grace replacement

sub vcl_candidate {
    if (obj.ttl > 0s) {
        return (hit);
    }
    set obj.refresh = true;
    # XXX would really want a native variable instead of req.http.max-grace
    if (obj.age > 0 - std.integer(req.http.max-grace, obj.grace)) {
        return (candidate(0 - obj.age));
    if (obj.busy) {

    return (candidate(0 - std.integer(req.http.max-grace)));

    }
}

* get fresh content for bgfetches from another varnish

Toy example for simplicity, the ttl calculation for real implementations
will be more complex

sub vcl_candidate {
    if (obj.ttl < 1m && req.http.Is-Cluster-Bgfetch) {
        ## I feel that this should be return (ignore)
        ## no. another varnish issued a bgfetch, so on this one, we
        ## want a sync fetch
        return (miss);
    }
}

* always miss

sub vcl_candidate {
    # ... = same criterion as would enable always_miss in vcl_recv
    if (...) {
        return (miss);
    }
}

* ignore_busy

sub vcl_candidate {
    if (obj.busy && ...) {
        return (ignore);
    }
}

builtin.vcl:

sub vcl_candidate {
    if (obj.busy) {
        return (candidate(0));
    }

    if (obj.ttl >= 0s) {
        if (obj.hitmiss) {
            return (miss);
        }
        if (obj.hitpass) {
            return (pass);
        }
        return (hit);
    }

    if (obj.ttl + obj.grace > 0s) {
        set obj.refresh = true;
    }

    # grace objects have metric > 0, so they
    # are preferred over busy objects
    # keep objects have metric < 0, so busy wins
    return (candidate(obj.ttl + obj.grace));
}

# we enter here with
# - the first object for which vcl_candidate returned hit
# - or the object with the highest candidate metric
# - but never for busy objects
#
# XXX do we want obj.metric from vcl_candidate?

sub vcl_hit {
    # or (obj.metric > 0) ?
    if (obj.ttl + obj.grace > 0s) {
        return (deliver);
    }
    # foreground fetch for "object in keep"
    return (miss);
}

nigoroll commented 6 years ago

Outcome of last VDD's discussion: We do neither want a VCL nor an official vmod interface implementation of callbacks during HSH_Lookup. So we decided to

Bring back req.grace and change vcl_hit to just return(deliver) -> @hermunn
Add as an unofficial interface a callback for varnish wizards who won't complain @nigoroll

nigoroll commented 6 years ago

what's left TODO:

Add as an unofficial interface a callback for varnish wizards who won't complain (aka catflap) @nigoroll
remove return(miss) from vcl_hit

nigoroll commented 5 years ago

I had a quick look again regarding the catflap, current plan is:

implement #2743 first
resurrect useful bits from #2755
implement the catflap

I will be away Dec 11 - Jan 11 and will not attempt this before I return

nigoroll commented 5 years ago

So during today's bugwash I promised that I would be working on this, even tough I didn't plan to - and I have. But so far I have only finished the first two items, and I will not feel bad about it, but rather continue with a brain working at full capacity tomorrow.

nigoroll commented 5 years ago

FTR, I am working on the catflap but need something along the lines of #2855 / #2856 as a prerequisite. These are waiting for review

nigoroll commented 5 years ago

Waiting for #2858

bsdphk commented 5 years ago

Finally!

varnishcache / varnish-cache

Backend request coalescing fail / vcl_hit{} returns miss without busy object. Doing pass. #1799