hash: A waiting list match is a formal revalidation

dridi commented 6 months ago

From the main commit message:

Not only does the waiting list operate on an objcore, it also passes a reference to requests before they reembark a worker. This way when an objcore is already present during lookup we can attempt a cache hit without holding the objhead lock.

Instead of repurposing the req::hash_objhead field into an equivalent req::hash_objcore, the field is actually removed. In order to signal that a request comes back from its waiting list, the life time of the req::waitinglist flag is extended until cnt_lookup() is reentered.

This change creates a new first step in the lookup process:

rush match (when coming from the waiting list)

hash lookup (unless coming from the waiting list)

objhead lookup (unless there was a rush hit)

If the rush match is a hit, the objhead lock is briefly acquired to release req's reference, to rely solely on the objcore's reference like a normal hit.

This shifts the infamous waiting list serialization phenomenon to the vary header match. Knowing that a rushed object is guaranteed to lead to a cache hit allows the rush policy to be applied wholesale, instead of exponentially. Only requests incompatible with the objcore vary header may reenter the waiting list, a scenario no different from the spurious rush wakeups when operating solely on objhead updates.

If a cacheable object was inserted in the cache, but already expired, this behavior enables cache hits. This can be common with multi-tier Varnish setups where one Varnish server may serve a graced object to an other, but true of any origin server that may serve stale yet valid responses.

The waiting list enables a proper response-wide no-cache behavior from now on, but the built-in VCL prevents it by default. This is also the first step towards implementing no-cache and private support at the header field granularity.

This is a mix bag of cherry-picks from #3992 and new original commits. See individual commit messages for more details.

dridi commented 6 months ago

Rebased against master and force-pushed to solve a merge conflict introduced by 083505193a5065fab0359568816b5e733b69edc3.

dridi commented 6 months ago

Rebased against master and force-pushed to solve a merge conflict introduced by 7aeca4d63.

dridi commented 6 months ago

Discussion on IRC, click to expand.

> 15:26 < dridi> slink: morning 15:26 < slink> bonjour dridi 15:26 < dridi> would you like to discuss waiting list? if you are not available don't worry, there's no rush 15:27 < slink> 1sec 15:27 * dridi counts to one 15:29 < slink> from "great evils of humanity" we present to you: the symbolic second 15:29 < slink> https://github.com/varnishcache/varnish-cache/pull/4032#discussion_r1440660399 15:30 * dridi finished counting to one 15:30 < slink> You might be right, that the back and forth was mostly due to the review, but in general, I am not so sure if the split is helpful. It might be. This is not paraphrasing "I think you are wrong", I am really not sure 15:31 < dridi> but at the same time i agree with you that if better, it's not all neat and tidy 15:32 < dridi> miss vs expired miss for example... 15:33 < dridi> do you need time to catch up with my other replies? 15:33 < slink> I was wondering if having a function boundary at an interface like https://github.com/varnishcache/varnish-cache/pull/4032#discussion_r1440664588 made sense, where the split-out function would just return some objects and leave the decision logic in HSH_Lookup 15:34 < dridi> right 15:34 < slink> But my main concern really is that, with the abstraction we should be totally clear about when we take a reference and when not. 15:34 < dridi> so you want to return all of oc, exp_oc and busy_oc, right? 15:35 < dridi> it probably deserves to be tested 15:35 < slink> I have not tried it, so I do not know how the code looks, I just have the feeling that it might be cleaner 15:35 < dridi> oh btw a minute ago i found one more thing to squash, i'll push it now 15:35 < slink> so my main irritation is regarding OC_F_BUSY. 15:36 < dridi> done 15:36 < slink> My understanding is that we currently only clear this flag by calling HSH_Unbusy(), and I do not see why we should change that 15:37 < slink> How is the busy flag relevant for the waitinglist optimization? 15:37 < dridi> sure, i can take a stab at it now 15:37 < dridi> so, when i started looking at the waiting list a year ago, i noticed that we trigger rushes from many places 15:38 < dridi> directly, or indirectly like when we free an objcore when the last ref is gone 15:39 < slink> yes. This goes back to when I looked at this, I had a pr/patch which would reduce the rushes to what I thought was a correct minimum and it was feared by others that we would miss rushes, leading to what surfaces as hangs/lockups 15:39 < dridi> so my reasoning was that we were lacking a step to clearly signal objcore usability or lack thereof 15:39 < slink> So I agree on the premise that we have too many rushes. 15:39 < dridi> now, HSH_Unbusy() was the step to clearly signal objcore usability by other tasks 15:39 < dridi> but it was only the happy path 15:40 < dridi> my conclusion was that we needed something for the failing path, embodied by HSH_Fail() 15:40 < dridi> now HSH_Fail() is interesting because it may happen after HSH_Unbusy() was called because of streaming 15:41 < dridi> at this point i thought i was done with ensure this step was always reached (triggering a rush) 15:41 < dridi> the test suite failed, that's when i realized we could never fetch a miss, that's the withdraw case 15:42 < dridi> at the same time i realized that a purge transition was acting like a withdrawn miss 15:42 < dridi> hence the introduction of HSH_Withdraw() 15:42 < slink> hmmm 15:42 < dridi> it was supposed to exist regardless to bring back the rush-just-one case 15:43 < dridi> and with that behind, all that was left failing in my test suite were synth responses and request bodies 15:43 < dridi> so to me, the BUSY flag meant that we were waiting for a caching outcome when parked on the waiting list 15:44 < dridi> since we'd never park for either a synth resp or a request body, they don't need the flag 15:45 < slink> My mental model has always been that Unbusy signals when the headers are complete such that we can reason on vary and potentially other criteria (catflap). 15:45 < dridi> yes, for the streaming case 15:45 < slink> also for non-streaming case, IMHO. 15:46 < slink> When we also have the body, we can still reason on the headers. 15:46 < dridi> yes, but you get notified when everything is ready, not just the headers 15:46 < dridi> ok, we agree 15:46 < slink> Now I am not sure if I devoured your patch enough, but if there are some OCs which basically never un-busy because they basically never busy, can 15:46 < slink> t we rush right away? 15:47 < dridi> no, because if they are never looked up with a busy flag (synth and vrb) there's nothing on their waiting lists 15:47 < dridi> private_oh's waiting list should always be empty anyway 15:48 < dridi> we should maybe add assert(oh != private_oh) in hsh_rush 15:48 < slink> we should to that straight away 15:48 < dridi> i will push a commit to the branch 15:48 < slink> So sure, vrb can never wake up anything, and for synth, we should rugh before the synth, not after 15:49 < slink> but I somehow don't understand why we need to change the BUSY semantics to do that 15:49 < dridi> yes, that's what HSH_Withdraw() does 15:50 < slink> I fear we'll be adding special casing 15:50 < dridi> if a miss doesn't fetch because it goes to synth, it rushes potential waiters 15:50 < dridi> special casing is already the norm 15:51 < slink> but it needs to, because another request now needs to fetch, right? 15:51 < dridi> whenever you deref an objcore you have to figure out what to rush between 0, 1, policy, nuggets, etc 15:51 < slink> the idea was that it should rush 1 only 15:51 < slink> the idea was that at the call site of the deref we know 15:52 < slink> but as I said, I am aware that the calls are not optimal 15:52 < dridi> knowledge at the call site is what makes the waiting list unwieldy today 15:52 < slink> my naive idea would be to just fix those call sites, add your vary check to lookup and be done? 15:53 < slink> I am not convinced that fixing the call sites is not sufficient. 15:54 < slink> Again, I KNOW that at the time my PR was "softened" out of fear that we could miss to rush where we need to 15:55 < dridi> having an exponential rush for cacheable objects is what forces us to keep rushes from everywhere 15:55 < slink> and I think your attempt is the first after like a decade to approach this topic again 15:55 -!- slink [slink@p3e9d4f47.dip0.t-ipconnect.de] has quit [Connection closed] 15:55 -!- slink [slink@p3e9d4f47.dip0.t-ipconnect.de] has joined #varnish-hacking 15:55 -!- slink1 [slink@p3e9d4f47.dip0.t-ipconnect.de] has joined #varnish-hacking 15:55 < dridi> what's the last reply you saw? 15:56 < slink1> flaky wifi. 15:56 < slink1> (15:52:21) dridi: knowledge at the call site is what makes the waiting list unwieldy today 15:56 < slink1> my naive idea would be to just fix those call sites, add your vary check to lookup and be done? 15:56 < slink1> I am not convinced that fixing the call sites is not sufficient. 15:56 < slink1> Again, I KNOW that at the time my PR was "softened" out of fear that we could miss to rush where we need to 15:56 < slink1> and I think your attempt is the first after like a decade to approach this topic again 15:56 < dridi> 15:55 < dridi> having an exponential rush for cacheable objects is what forces us to keep rushes from everywhere 15:57 < slink1> ah right, that is a good point indeed. 15:57 < dridi> so yes, there is this "unbusy" thing i mentioned earlier 15:58 < dridi> the mandatory step where unbusy is the happy path, fail is the sad path, and withdraw is the lack of fetch 15:58 < dridi> and in addition, the objcore conveys the rush outcome 15:58 < dridi> exponential or wholesale 15:59 < slink1> hmmm 15:59 < dridi> wholesale allows us to be done with it at the mandatory unbusy step 15:59 < slink1> wholesale could still wake up too many waiters which then race 15:59 < dridi> fail triggers an immediate exponential rush, whereas before it was delayed until the deref 16:00 < dridi> waiters don't race, they come back with an oc ref 16:00 < slink1> "race" to get the right object 16:00 < dridi> no 16:00 < dridi> that race is actually gone with this patch series 16:00 < dridi> you give the req the oc ref at rush time 16:01 < slink1> but that's still a bet right? 16:01 < dridi> so when they wake up and execute again, they already have everything they need 16:01 < slink1> it could still turn out to not match vary 16:01 < dridi> yes 16:01 < slink1> that's what i mean 16:02 < dridi> vary is the only reason to still see spurious wake ups and even waiting list serialization 16:02 < slink1> and the more waiting reqs we wake up, the more they will only sommon again on the next waiting list 16:02 < slink1> at least that is my understanding 16:02 < dridi> you bring an interesting point 16:02 < dridi> should we move the rush match at rush time? 16:03 < dridi> rush all compatible reqs, and up to rush_exponent incompatible variants? 16:03 < dridi> i'm not sure we can do that from the worker triggering the rush 16:04 < dridi> it also becomes an O(n) match for that worker vs O(1) for individual reqs 16:04 < slink1> oh and btw, I think the catflap needs to be considered in hsh_rush_match() 16:05 < dridi> oh, good catch 16:05 -!- slink [slink@p3e9d4f47.dip0.t-ipconnect.de] has quit [Ping timeout: 301 seconds] 16:05 < dridi> that one's going to be a bummer though, we need to lock the oh, don't we? 16:06 < slink1> yes 16:06 < dridi> i'm tempted to fail the match if there is a vcf 16:07 < slink1> At this point I feel like I need to think about this more. I am sure that we can optimize the rushes, and I absolutely agree with your analysis that there are too many 16:07 < dridi> not just too many rushes, also not enough wake ups! 16:07 < dridi> the test case is verbatim the same as the vtc of the previous pull request 16:07 < slink1> In an ideal world, what I would like to do is change the existing code (fix DerefObjCore, add earlier rushes where we can) and _then_ understand what we can not do without more involved changes like changing the OC_F_BUSY semantics etc\ 16:08 < dridi> it enables more waiting list hits 16:08 < slink1> You might be right 16:09 < slink1> I am just not capable of seeing if you are. 16:09 < dridi> i appreciate the time you already spent on it 16:09 < slink1> it is interesing and I do value the time and effort you put into it

nigoroll commented 6 months ago

I looked at this again, and overall, I would like to ask for the following:

For one, I would prefer to split refactoring and semantic changes: I am not convinced of the HSH_Lookup() change as proposed, and IIUC is has nothing to do with the rest of the PR.

IIUC, the other changes can be summarized as:

Retire hash_objhead and just use req->objcore for the waitinglist
Optimize HSH_Lookup with hsh_rush_match()
Rush earlier when we transition to a private object

I do not see how we need to refactor so much to achieve these goals, and I would prefer if we first addressed these based on the existing interfaces (with some adjustments, maybe).

In particular, the OC_F_BUSY change looks dubious to me.

But most importantly, there is something about an argument from 0a87e86753c8e725dff545bd91f7c0c7b7bad76f which I do not understand:

Knowing that a rushed object is guaranteed to lead to a cache hit allows the rush policy to be applied wholesale

I really do not understand how this patch series achieves this, I don't even understand how it can be achieved at all for the general (Vary) case. Yes, this is true if the busy object has no Vary (we do not know if the busy object has a Vary before it gets unbusied), but in general, do I not understand or is it just not true?

dridi commented 5 months ago

IIUC, the other changes can be summarized as:

Retire hash_objhead and just use req->objcore for the waitinglist

Optimize HSH_Lookup with hsh_rush_match()

Rush earlier when we transition to a private object

That's a summary capturing the essential changes, but slightly off.

The req->hash_objhead retirement is not so much about "just using" req->objcore (req->waitinglist's scope is also extended) but rather about driving the rush policy from an "unbusied" objcore.

The hsh_rush_match() function is not here as an optimization, but part of the removal of waiting list seralization (modulus incompatible variants).

And the new rush "latches" are not so much here to rush earlier but really to nail consistent semantics for OC_F_BUSY (and in that regard other rush paths go away for the same reason).

Knowing that a rushed object is guaranteed to lead to a cache hit allows the rush policy to be applied wholesale

I really do not understand how this patch series achieves this, I don't even understand how it can be achieved at all for the general (Vary) case.

It does so as far as compatible variants go, so a Vary: User-Agent header could still trigger severe serialization when coupled with zero TTL and grace. I never claimed otherwise, and this is what I made my first presentation on the topic.

Please check slide 12: https://github.com/varnishcache/varnish-cache/wiki/VDD23Q1#compliance

but in general, do I not understand or is it just not true?

I think you missed it from 0a87e86753c8e725dff545bd91f7c0c7b7bad76f's commit message:

This shifts the infamous waiting list serialization phenomenon to the vary header match.

Rewinding in your comment:

I do not see how we need to refactor so much to achieve these goals, and I would prefer if we first addressed these based on the existing interfaces (with some adjustments, maybe).

In particular, the OC_F_BUSY change looks dubious to me.

Some of the refactoring (like the lookup split in two steps) is me feeling uneasy about touching this part, and in particular the locking logic. Separating the lookup search from lookup outcome really helped me (and you seemed to appreciate one of the effects on is_hitmiss handling but I can no longer find your comment). The rest of the refactoring is me preparing the landscape to land the change.

I can try a patch series without the non-essential refactoring. You found a few things that require the change to be somehow amended anyway (like the overlooked VCF).

Now regarding the OC_F_BUSY semantics I understood, resulting in changes in this patch series, they are the following:

during a lookup we consider "busy" objcores as a fallback
busy objcores don't have a caching outcome, yet
only fetch tasks can insert cacheable objects
- the first time, they can later be revived by HSH_Insert(), and trigger a rush
therefore busy objcores can only belong to fetch tasks
therefore the following cases should not block a lookup:
- synthetic responses
- request bodies
- purges (they actually do, until the purge is performed)
- misses transitioning to vcl_synth

In the last case, the moment we know we won't trigger a fetch task there is no reason (besides hurting latency to sell more licenses) to keep the OC_F_BUSY flag. Likewise, when a fetch task fails, the object should be "unbusied" immediately (we just got the cacheability outcome) instead of waiting for the last objcore ref to be dropped.

I hope this answers all of your questions, thank you for your patience!

nigoroll commented 5 months ago

FTR, need to devour this:

(20:07:05) slink: done reading once, I will need to get back to it again. There is still something about your "guaranteed to lead to a cache hit" argument which I do not get, but I am having trouble phrasing it. Maybe this regarding "It does so as far as compatible variants go": How do we know variants are compatible before doing a vary compare? How can we do better than wait for a response and then wake up waiters? Sure, we could check the waiters vary before waking them up, but that moves the burden of the vary check to a single thread vs. multiple, and if your patch does this, I must have overlooked it. (20:13:20) dridi: also (20:13:43) dridi: the deferred rush match happens outside of the objhead lock (20:14:03) dridi: so i think it should remain the way i did it (20:14:57) dridi: i suppose "guaranteed cache hit" should always be followed by "modulus compatible variant" (20:15:44) dridi: i understand your confusion (20:16:04) dridi: i still need to amend the rush match to bail out in the presence of a cat flap (20:16:41) dridi: to answer your question about doing it better (20:16:56) dridi: we could have some kind of predictive vary in the form of a hash (20:17:24) dridi: some precomputed vary outcome (20:19:43) dridi: still, i wouldn't check it in hsh_rush1() under the oh lock

dridi commented 5 months ago

Yesterday I rewrote the patch series without the offending refactoring, and taking changes in a slightly different order helped better lay out the moving parts.

I tried to avoid confusing wording in the commit log, and tried to be more explicit about semantics, so I hope this will be overall easier to review.

nigoroll commented 5 months ago

FTR, I understand you are waiting for my feedback, but this PR need a longer attention span and I need to tend to other work for the remainder of this week. Sorry.

dridi commented 5 months ago

My goal was to answer everything before you'd take another dive, so in that sense I succeeded ;)

dridi commented 5 months ago

I added 6 SQUASHME commits to address @mbgrydeland's review and one more to add an assertion.

nigoroll commented 4 months ago

I still did not find the hours which this PR requires for another review round, but I just tried it with one of my test setups and ran into this null pointer deref straight away:

#6  0x000055f99a94b07b in child_signal_handler (s=11, si=0x7f048982eaf0, c=0x7f048982e9c0) at cache/cache_main.c:326
        buf = "Signal 11 (Segmentation fault) received at 0x58 si_code 1", '\000' <repeats 966 times>
        sa = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
        req = 0x7f0476664a60
        a = 0x58 <error: Cannot access memory at address 0x58>
        p = 0x7f0480d903f8 "8\003ـ\004\177"
        info = 0x0
#7  <signal handler called>
No locals.
#8  0x000055f99a93dab2 in HSH_Lookup (req=0x7f0476664a60, ocp=0x7f0480d8f670, bocp=0x7f0480d8f668) at cache/cache_hash.c:435
        wrk = 0x7f0480d90338
        oh = 0x0
        oc = 0x7f0468a03880
        exp_oc = 0x7f0480d8f620
        vr = 0x7f0476664bb8
        exp_t_origin = 0
        busy_found = 1
        vary = 0x7f0480d8f640 "\220\366؀\004\177"
        boc_progress = -1
        xid = 0
        dttl = 0
#9  0x000055f99a95d470 in cnt_lookup (wrk=0x7f0480d90338, req=0x7f0476664a60) at cache/cache_req_fsm.c:578
        oc = 0x7f0468a03880
        busy = 0x0
        lr = HSH_BUSY
        had_objcore = 1
#10 0x000055f99a959d07 in CNT_Request (req=0x7f0476664a60) at cache/cache_req_fsm.c:1192
        ctx = {{magic = 2161702624, syntax = 32516, vclver = 2594048900, method = 22009, vpi = 0x7f0480d903b8, msg = 0x7f0476201010, vsl = 0x5800000000, vcl = 0x7f0476200f60, 
            ws = 0x7f0480d8f710, sp = 0x55f99a95f747 <ses_get_attr+343>, req = 0x7f0480d8f740, http_req = 0x800000094, http_req_top = 0x7f0476200f20, http_resp = 0x128, bo = 0x7f0480d8f730, 
            http_bereq = 0x55f99a960327 <SES_Get_proto_priv+39>, http_beresp = 0x7f0480d8f740, now = 6.8999794284505472e-310, specific = 0x7f0480d8f750, 
            called = 0x55f99a99d16e <http1_getstate+30>}}
        wrk = 0x7f0480d90338
        nxt = REQ_FSM_MORE
#11 0x000055f99a99ce0a in HTTP1_Session (wrk=0x7f0480d90338, req=0x7f0476664a60) at http1/cache_http1_fsm.c:392
        hs = 22009
        sp = 0x7f0476200f20
        st = 0x55f99aa51e5e <H1PROC> "HTTP1::Proc"
        i = 32516
#12 0x000055f99a99c350 in http1_req (wrk=0x7f0480d90338, arg=0x7f0476664a60) at http1/cache_http1_fsm.c:87
        req = 0x7f0476664a60
#13 0x000055f99a98d382 in Pool_Work_Thread (pp=0x7f0482200000, wrk=0x7f0480d90338) at cache/cache_wrk.c:496
        tp = 0x7f0480d8f858
        tpx = {list = {vtqe_next = 0x7f0480d51360, vtqe_prev = 0x7f0482200078}, func = 0x55f99a99c1c0 <http1_req>, priv = 0x7f0476664a60}
        tps = {list = {vtqe_next = 0x0, vtqe_prev = 0x0}, func = 0x55f99a953320 <pool_stat_summ>, priv = 0x7f0482205000}
        tmo = 0
        now = 1707760746.3839386
        i = 5
        reserve = 6
...

(gdb) fr 8
#8  0x000055f99a93dab2 in HSH_Lookup (req=0x7f0476664a60, ocp=0x7f0480d8f670, bocp=0x7f0480d8f668) at cache/cache_hash.c:435
435         boc_progress = oc->boc == NULL ? -1 : oc->boc->fetched_so_far;

dridi commented 4 months ago

It is embarrassing to find that I forgot to query the boc under the oh lock, especially since we discussed it when the construct was first introduced (see https://github.com/varnishcache/varnish-cache/pull/3566#issuecomment-874208586). At the very least we found out quickly, thanks!

Taken care of in https://github.com/varnishcache/varnish-cache/pull/4032/commits/141308d719858842b9b3d527d5c7a280ead5c482, could you please give it another try?

dridi commented 4 months ago

Thanks to @simonvik who is helping me getting traffic through this pull request we found that a test case for vmod_saintmode would panic. It became quickly obvious to me that this case that slipped through the cracks shows that the model for this pull request is coherent.

It's the new HSH_Withdraw() building block that solves the problem, and in hindsight it's obvious that an objcore is effectively being withdrawn.

The one-liner fix is in cf932ca620f2a690eb0d2e5470560ef2103e7871 and coverage was added directly to the master branch in 3eeb8c4f5014e2abcbc87292b876f40b0b1f09e8.

nigoroll commented 4 months ago

FYI: For a multitude of reasons I still did not find the few hours I would like to have for quiet work to thoroughly review the new state of this PR. I could spend about half an hour with it, and I currently feel both under- and overwhelmed by it. Underwhelmed, because I still fail to see why the goals of this PR need so many changes (I see this as my own mistake), and I am overwhelmed by the fact that at this point the PR can effectively only be reviewed as one large diff due to the incremental changes.

I am still worried about the change of the busy state model and the general impact this might have, I wonder if we really need the withdrawn facility and I am not convinced that we actually DTRT in all the different places.

nigoroll commented 4 months ago

Just in case I will be away for more before-release bugwashing: I would really prefer to move this one to the next cycle to get the chance for a thorough review. See previous comments for reasons.

dridi commented 4 months ago

I fixed an incorrect assertion that surfaced from production traffic. It otherwise ran successfully on Friday and over the weekend.

dridi commented 4 months ago

From Friday's bugwash: I will squash all the commits that followed @mbgrydeland's review, close this pull request and open a new one based on this patch series to start a new discussion.

dridi commented 4 months ago

https://github.com/varnishcache/varnish-cache/pull/4073

varnishcache / varnish-cache

hash: A waiting list match is a formal revalidation #4032