Do we handle 304 responses correctly?

nigoroll commented 10 months ago

I'm wondering if there aren't some potential problems unrelated to this patch that we have uncovered. The fact that Varnish will enter 304 revalidation logic even though we never asked for it (didn't send out any IMS/INM headers) is worrying to me. I wonder if we shouldn't keep track somehow of whether to expect a 304, and error out if we still got a 304. Could prevent a future gotcha at least.

Originally posted by @mbgrydeland in https://github.com/varnishcache/varnish-cache/issues/4013#issuecomment-1838688414

walid-git commented 10 months ago

We already error out when we get an unexpected 304 response from the backend : https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishd/cache/cache_fetch.c#L375-L378

    if (bo->stale_oc != NULL &&
        ObjCheckFlag(bo->wrk, bo->stale_oc, OF_IMSCAND)) {
        ....
    } else if (!bo->uncacheable) {
        /*
         * Backend sent unallowed 304
         */
        VSLb(bo->vsl, SLT_Error,
            "304 response but not conditional fetch");
        bo->htc->doclose = SC_RX_BAD;
        vbf_cleanup(bo);
        return (-1);
    }

mbgrydeland commented 10 months ago

I guess the discrepancy to investigate is that OF_IMSCAND doesn't track whether we sent IMS/INM on the fetch, only if we could've done that. And especially for zero-length objects, it is possible to not send IMS/INM, but still execute 304 handling if the backend responded with 304 anyways.

dridi commented 10 months ago

I'm of the opinion that if we can't see it this problem does not exist, so may I suggest starting with a new addition to vmod_debug?

.. NB: keep enum in sync with obj_attr.h
$Function INT obj_attr_size($Enum {LEN, VXID, FLAGS, GZIPBITS, LASTMODIFIED, VARY, HEADERS, ESIDATA, GZIPED, CHGCE, IMSCAND, ESIPROC})
$Restrict vcl_deliver, vcl_backend_refresh, vcl_backend_response

Give the size of an attribute or zero if it is missing for:

- ``obj`` in ``vcl_deliver``
- ``obj_stale`` in ``vcl_backend_refresh``
- ``beresp`` in ``vcl_backend_response``

This way it can also be used in boolean expressions. Once we have that we can add more test cases, or more check to existing test cases. And since we want to solve this before #3994 we would initially skip the obj_stale inspection.

Does it make sense?

bsdphk commented 9 months ago

My view is that the IMS header indicates to VCL that conditional fetch is available, but nothing prevents VCL from replacing it with If-None-Match or another "better" conditional, for whatever value of "better" the VCL writer prefers.

The only dubious case that the bereq has no conditionals and a buggy backend sends 304, something I do not see as a high priority for us.

I'm not even convinced that all conditionals, also private ones, are mandated to have "If-*" as a prefix, is that written down somewhere ?

dridi commented 2 months ago

I wen through what the RFCs have to say about 304 in general and caching, and my conclusion is that as of today we are doing things mostly correctly, and probably close to the best we can do with the current management.

One big difference between the way we handle revalidation and what the RFCs have to say is the single stale object we embark in a [cond]fetch task. We're supposed to update the headers of all the stored responses matching a 304, but we only do it with the single stale response. For practical reasons, we should keep doing it.

The RFCs talk about preconditions based on weak and strong validators, but says nothing about preconditions being expressed with header fields prefixed with if-:

Other preconditions, defined by other specifications as extension fields, might place conditions on all recipients, on the state of the target resource in general, or on a group of resources.

We support three preconditions, two of which raising OF_IMSCAND.

Regarding validators, only two are defined by the standard, and those are the two we currently use. last-modified is considered weak by default, but can be treated as strong by the application if the clock resolution is deemed good enough. etag is explicitly marked as weak, or strong by default, and we already have rules to weaken a strong entity tag when we fiddle with a response body.

As of today, with revalidation handled by the core code, I think it is fair game to accept 304s only when we expected them after setting up the preconditions.

Another thing to consider is the request method, and in order to refresh a stored response there needs to be a match there as well. Considering how all cache misses are turned into GET backend requests, we can leave this as an exercise to the VCL author to support revalidation in other contexts. I didn't look at the WebDAV spec, but it is the main well-known extension, plus the built-in VCL currently turns WebDAV traffic into pipe mode (caching WebDAV has always been an exercise left to the VCL author).

What we should consider for #3994 moving forward:

check current preconditions in vcl_backend_refresh
- the built-in VCL "fails" (see below) a refresh in the absence of known preconditions
check stale_obj validity in vcl_backend_refresh
- if the stale template object is no longer usable, "fail" the refresh (see below)
- likewise, by default we only refresh 200 responses
when a refresh logically fails
- unset built-in preconditions (bereq.http.{if-modified-since,if-none-match}
- return (retry (fetch)) as suggested in #4082

Something like this for the built-in:

sub vcl_backend_refresh {
        call vcl_builtin_backend_refresh;
        return (merge);
}

sub vcl_builtin_backend_refresh {
        call vcl_refresh_valid;
        call vcl_refresh_status;
        call vcl_refresh_conditions;
}

sub vcl_refresh_valid {
        if (!obj_stale.is_valid) {
                call vcl_refresh_error;
        }
}

sub vcl_refresh_status {
        if (obj_stale.status != 200) {
                call vcl_refresh_error;
        }
}

sub vcl_refresh_conditions {
        if (!bereq.http.if-modified-since &&
            !bereq.http.if-none-match) {
                call vcl_refresh_error;
        }
}

sub vcl_refresh_error {
        unset bereq.http.if-modified-since;
        unset bereq.http.if-none-match;
        return (retry(fetch));
}

Anyway, as far as this ticket is concerned, I think we are fine with our current 304 handling and we can close it.

dridi commented 2 months ago

Slightly modified built-in to avoid side effects on bereq.retries that could disturb existing VCL:


sub vcl_backend_refresh {
        call vcl_builtin_backend_refresh;
        return (merge);
}

sub vcl_builtin_backend_refresh {
        call vcl_refresh_valid;
        call vcl_refresh_status;
        call vcl_refresh_conditions;
}

sub vcl_refresh_valid {
        if (obj_stale.retried) { # read-only, analogous to bereq.retries, but BOOL
                return (error);
        }
        if (!obj_stale.is_valid) {
                call vcl_refresh_retry;
        }
}

sub vcl_refresh_status {
        if (obj_stale.status != 200) {
                call vcl_refresh_retry;
        }
}

sub vcl_refresh_conditions {
        if (!bereq.http.if-modified-since &&
            !bereq.http.if-none-match) {
                return (error);
        }
}

sub vcl_refresh_retry {
        unset bereq.http.if-modified-since;
        unset bereq.http.if-none-match;
        # Same transition as return (fetch) from vcl_backend_fetch,
        # but turns into an error if obj_stale.retried already true.
        return (fetch);
}

walid-git commented 2 months ago

sub vcl_refresh_conditions {
        if (!bereq.http.if-modified-since &&
            !bereq.http.if-none-match) {
                call vcl_refresh_error;
        }
}

I think this should rather be return (error); because if we got a 304 from the backend when we didn't ask for it, then there is no point in retrying the exact same request another time and expect a different response.

dridi commented 2 months ago

I agree, I will edit my previous comment accordingly.

edit: I also renamed vcl_refresh_error to vcl_refresh_retry

varnishcache / varnish-cache

Do we handle 304 responses correctly? #4026