scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
53.35k stars 10.59k forks source link

RFC2616 cache policy checks max-age against wrong date after revalidation #3778

Open jameysharp opened 5 years ago

jameysharp commented 5 years ago

Expected behavior

A Cache-Control: max-age=60 request header should prevent requests to the same URL from being sent more frequently than once per minute. If spiders generate requests for that URL more often, then they should be satisfied from the cache without contacting the origin server.

(For simplicity, I'm assuming that either the origin server hasn't set a shorter max-age and hasn't made the response uncacheable, or HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS and HTTPCACHE_ALWAYS_STORE are set to ignore those restrictions.)

Actual behavior

The header prevents sending requests for the first minute, but after that every request results in contacting the origin server for revalidation.

Root cause

RFC 7234 says:

A response's age is the time that has passed since it was generated by, or successfully validated with, the origin server.

RFC2616Policy._compute_current_age computes the wrong age when an Age header is present or after at least one 304 response has been received.

As a result, scrapy does not correctly interpret the Cache-Control: max-age=<seconds> request header.

Proposed fixes

Fix handling of Age response header

RFC2616Policy._compute_current_age is computing results in one of two different time-bases, depending on whether the Age header is present. Without the Age header, it returns age relative to now. However the Age header only provides age relative to when the response was received, which will usually be quite different.

Fixing this requires finding out when the response was stored in the cache. Since the existing cache storage implementations maintain a timestamp or <key>_time metadata field, existing caches already have that information; it just isn't currently exposed to the cache policy.

Freshen cached response on successful validation

The RFC2616Policy implementations of should_cache_response and is_cached_response_valid ensure that a 304 response never affects what the spider sees or the cache contents. However, that violates the requirements of RFC 7234 4.3.4:

When a cache receives a 304 (Not Modified) response and already has one or more stored 200 (OK) responses for the same cache key, the cache needs to identify which of the stored responses are updated by this new response and then update the stored response(s) with the new information provided in the 304 response.

Since scrapy only keeps one version of the response, and used that version to generate its conditional GET request, identifying which one to update is trivial. Then,

...the cache MUST:

  • delete any Warning header fields in the stored response with warn-code 1xx (see Section 5.5);

  • retain any Warning header fields in the stored response with warn-code 2xx; and,

  • use other header fields provided in the 304 (Not Modified) response to replace all instances of the corresponding header fields in the stored response.

In addition, since we need to use the response timestamp to compute the age of the response, that has to get updated to the current time as well. (This has the nice side benefit that the DBM and LevelDB storage backends won't expire responses which have been successfully validated recently.)

Inappropriate use of now

RFC2616Policy._compute_current_age shouldn't use now as a default date (in ... .get(b'Date')) or now) because that method is invoked long after the initial request was submitted, when the cached response is being checked for usability for a later request. now just isn't a meaningful substitute for the timestamp when the response was generated; the age computation can only be correct if there's a Date header in the response.

Fortunately, HttpCacheMiddleware.process_response already adds a Date header if one is not present, and does so before calling self.storage.store_response, so we can rely on it being present in all cached responses.

So the or now should just get dropped because it's unreachable, and even if it was reachable, it's also wrong.

(However, I'd think setting a default Date header shouldn't rely on enabling the cache middleware. If this were done somewhere globally, I think that would resolve #2221.)

jameysharp commented 5 years ago

It occurs to me that the test conditions can be made a little more general. A request header of Cache-Control: max-age=0, max-stale=60 should keep repeated requests from being validated for 60 seconds regardless of what max-age or Expires the original response had. (But other response directives could still force re-sending the request unless they're filtered with HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS.)

mossplix commented 5 years ago

@Gallaecio Should I go ahead and work on this or are there any more pending discussions?

Gallaecio commented 5 years ago

I think all the points @jameysharp raises make sense, so I think this is worth presenting a pull request. Even if a discussion were needed, a pull request can only enrich a discussion.

jameysharp commented 5 years ago

Thanks for looking into this, both of you!

The issue I'd thought would be hardest to tackle was handling the Age header correctly, because I thought the cache storage API would need to change.

But I want to propose a simpler fix than my first thought. How about just deleting the code which looks at the Age header? I'd be interested to hear if anyone has reasons why this header is important, but here's why I think it'd be okay for Scrapy to just ignore it:

We only need an estimate of the resource's freshness which is accurate enough to not change the interpretation of cache controls like max-age too much. If the origin server's clock is set to an older time than clients' clocks, clients which rely on the Date header will revalidate sooner than expected, resulting in more traffic than the server wanted.

But how much more is limited to (clock bias / max-age). For example, if the origin server's clock is 15 seconds behind all the clients' clocks, then for a resource with max-age set to five minutes, it'll see no more than 5% more requests than expected. If the clocks are biased in the other direction, the client won't see changes for, at worst, 5% longer than intended.

I'd guess, in practice:

Therefore I think it's reasonable to rely solely on the Date header.

The only case where I think it still might be worth paying attention to Age is if there is no Date header. This can only come up if neither the origin server nor any caching proxy along the way added a Date header, but at least one proxy along the route served a cached response without validating it and set Age accordingly.

So when Scrapy synthesizes a Date response header because there wasn't one already, I guess it ought to use time() - max(0, int(headers.get("Age", 0))).

But I'd guess that's such a rare occurrence that it hardly matters, and if an origin server wants its cache control directives interpreted correctly then it darn well should get its clock set correctly.

jameysharp commented 5 years ago

I think now I have a pretty good idea what patches I would write to fix these issues. @mossplix, have you made any progress tackling this? I'd be delighted to answer questions or otherwise help.

mossplix commented 5 years ago

Hi @jameysharp I was waiting to see how the discussion pans out. Let me jump on it. I will def shoot you question

mossplix commented 5 years ago

From section 13.2.3 of RFC2616, they actually specify the age calculation. Maybe we need to stick to the spec?

jameysharp commented 5 years ago

@mossplix I'm going to assume that's substantially the same as RFC7234 section 4.2.3, which updated the caching parts of RFC2616.

In my third comment (https://github.com/scrapy/scrapy/issues/3778#issuecomment-493211897) I explained why I don't think it's worth-while to implement the full age calculation. I suspect (but have not proven) that the Age header rarely occurs in the settings where Scrapy is used. If my guess is correct, then anything handling that header will rarely be tested and is likely to be broken (like it is now). If it doesn't happen very often, I'd personally prefer to delete it until someone comes up with a use case where it matters to them. I can think of several ways to fix it if you really want to, though.

That said, the part of this issue which has the most impact is that Scrapy does not "freshen cached response on successful validation." If you skipped the other two parts entirely and only fixed that, then that would be a huge improvement.