Open jameysharp opened 5 years ago
It occurs to me that the test conditions can be made a little more general. A request header of Cache-Control: max-age=0, max-stale=60
should keep repeated requests from being validated for 60 seconds regardless of what max-age
or Expires
the original response had. (But other response directives could still force re-sending the request unless they're filtered with HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS
.)
@Gallaecio Should I go ahead and work on this or are there any more pending discussions?
I think all the points @jameysharp raises make sense, so I think this is worth presenting a pull request. Even if a discussion were needed, a pull request can only enrich a discussion.
Thanks for looking into this, both of you!
The issue I'd thought would be hardest to tackle was handling the Age
header correctly, because I thought the cache storage API would need to change.
But I want to propose a simpler fix than my first thought. How about just deleting the code which looks at the Age
header? I'd be interested to hear if anyone has reasons why this header is important, but here's why I think it'd be okay for Scrapy to just ignore it:
We only need an estimate of the resource's freshness which is accurate enough to not change the interpretation of cache controls like max-age
too much. If the origin server's clock is set to an older time than clients' clocks, clients which rely on the Date
header will revalidate sooner than expected, resulting in more traffic than the server wanted.
But how much more is limited to (clock bias / max-age). For example, if the origin server's clock is 15 seconds behind all the clients' clocks, then for a resource with max-age
set to five minutes, it'll see no more than 5% more requests than expected. If the clocks are biased in the other direction, the client won't see changes for, at worst, 5% longer than intended.
I'd guess, in practice:
Therefore I think it's reasonable to rely solely on the Date
header.
The only case where I think it still might be worth paying attention to Age
is if there is no Date
header. This can only come up if neither the origin server nor any caching proxy along the way added a Date
header, but at least one proxy along the route served a cached response without validating it and set Age
accordingly.
So when Scrapy synthesizes a Date
response header because there wasn't one already, I guess it ought to use time() - max(0, int(headers.get("Age", 0)))
.
But I'd guess that's such a rare occurrence that it hardly matters, and if an origin server wants its cache control directives interpreted correctly then it darn well should get its clock set correctly.
I think now I have a pretty good idea what patches I would write to fix these issues. @mossplix, have you made any progress tackling this? I'd be delighted to answer questions or otherwise help.
Hi @jameysharp I was waiting to see how the discussion pans out. Let me jump on it. I will def shoot you question
From section 13.2.3 of RFC2616, they actually specify the age calculation. Maybe we need to stick to the spec?
@mossplix I'm going to assume that's substantially the same as RFC7234 section 4.2.3, which updated the caching parts of RFC2616.
In my third comment (https://github.com/scrapy/scrapy/issues/3778#issuecomment-493211897) I explained why I don't think it's worth-while to implement the full age calculation. I suspect (but have not proven) that the Age
header rarely occurs in the settings where Scrapy is used. If my guess is correct, then anything handling that header will rarely be tested and is likely to be broken (like it is now). If it doesn't happen very often, I'd personally prefer to delete it until someone comes up with a use case where it matters to them. I can think of several ways to fix it if you really want to, though.
That said, the part of this issue which has the most impact is that Scrapy does not "freshen cached response on successful validation." If you skipped the other two parts entirely and only fixed that, then that would be a huge improvement.
Expected behavior
A
Cache-Control: max-age=60
request header should prevent requests to the same URL from being sent more frequently than once per minute. If spiders generate requests for that URL more often, then they should be satisfied from the cache without contacting the origin server.(For simplicity, I'm assuming that either the origin server hasn't set a shorter
max-age
and hasn't made the response uncacheable, orHTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS
andHTTPCACHE_ALWAYS_STORE
are set to ignore those restrictions.)Actual behavior
The header prevents sending requests for the first minute, but after that every request results in contacting the origin server for revalidation.
Root cause
RFC 7234 says:
RFC2616Policy._compute_current_age
computes the wrong age when anAge
header is present or after at least one 304 response has been received.As a result, scrapy does not correctly interpret the
Cache-Control: max-age=<seconds>
request header.Proposed fixes
Fix handling of
Age
response headerRFC2616Policy._compute_current_age
is computing results in one of two different time-bases, depending on whether theAge
header is present. Without theAge
header, it returns age relative to now. However theAge
header only provides age relative to when the response was received, which will usually be quite different.Fixing this requires finding out when the response was stored in the cache. Since the existing cache storage implementations maintain a
timestamp
or<key>_time
metadata field, existing caches already have that information; it just isn't currently exposed to the cache policy.Freshen cached response on successful validation
The
RFC2616Policy
implementations ofshould_cache_response
andis_cached_response_valid
ensure that a 304 response never affects what the spider sees or the cache contents. However, that violates the requirements of RFC 7234 4.3.4:Since scrapy only keeps one version of the response, and used that version to generate its conditional
GET
request, identifying which one to update is trivial. Then,In addition, since we need to use the response timestamp to compute the age of the response, that has to get updated to the current time as well. (This has the nice side benefit that the DBM and LevelDB storage backends won't expire responses which have been successfully validated recently.)
Inappropriate use of
now
RFC2616Policy._compute_current_age
shouldn't usenow
as a default date (in... .get(b'Date')) or now
) because that method is invoked long after the initial request was submitted, when the cached response is being checked for usability for a later request.now
just isn't a meaningful substitute for the timestamp when the response was generated; the age computation can only be correct if there's aDate
header in the response.Fortunately,
HttpCacheMiddleware.process_response
already adds aDate
header if one is not present, and does so before callingself.storage.store_response
, so we can rely on it being present in all cached responses.So the
or now
should just get dropped because it's unreachable, and even if it was reachable, it's also wrong.(However, I'd think setting a default
Date
header shouldn't rely on enabling the cache middleware. If this were done somewhere globally, I think that would resolve #2221.)