psf / cachecontrol

The httplib2 caching algorithms packaged up for use with requests.
Other
468 stars 122 forks source link

CacheControl should cache any cacheable response #32

Closed jaraco closed 10 years ago

jaraco commented 10 years ago

According to the RFC, "By default, a response is cacheable if the requirements of the request method, request header fields, and the response status indicate that it is cacheable."

For this reason, and also because it's the behavior of other clients such as browsers, I would expect a simple 200 response to a simple GET request to be cached by CacheControl if there are no headers limiting the caching. CacheControl should cache these idefinitely. There should not need to be any flag such as in #18 to invoke this behavior.

ionrock commented 10 years ago

In issue #18 the idea was to cache responses that were specifically not cacheable. It sounds like you are saying CacheControl should cache any GET request, regardless of whether or not it has any headers signifying it should be cached (ie cache-control, ETag, etc.). Section 13.4 does a caching system MAY cache any response, but I don't believe that makes sense for CacheControl by default. From a user standpoint, you would have to ensure you provided something like a no-cache cache control header for any request after the initial request in order to avoid hitting the cache. The other concern is that every successful request could result in a rather large cache. If the user is using the in memory cache, this could quickly become cumbersome.

With that said, CacheControl does cache ETag responses indefinitely, which is different than httplib2's behavior.

Let me know if I'm misunderstanding what you mean. As I understand the RFC there needs to be some header that provides some information that is used by a cache in order to be sure it is cacheable. If you have responses that are meant to be cached that are not cached, then maybe we are missing something.

jaraco commented 10 years ago

headers signifying it should be cached

As I understand the RFC there needs to be some header that provides some information that is used by a cache in order to be sure it is cacheable.

From my perspective, the RFC does not provide for such headers. It provides for headers that limit caching and specifically say that the default is that a response is cacheable (as referenced in the original post). Where in the RFC do you see that some header is necessary to be sure the response is cacheable?

To back my understanding, if you observe how your browser behaves, you'll see that it will cache such responses. Consider this resource: http://xkcd.com/1/info.0.json

It returns very few headers and no cache-relevant headers. Firefox will cache that response. In this case, Firefox will cache the response and expire it after 130 minutes. Note that you have to load the resource by simply supplying the URL. If you use the refresh button (or similar), that will bypass the local cache. The user agent is allowed to make whatever optimizations it wants around caching. The user agent is responsible for keeping the cache efficient (i.e. not letting it over-consume resources).

What I find surprising behavior is that Cache-Control will only cache requests which have limits imposed on the caching (via headers), but not those that have no limits.

It would be reasonable for Cache-Control to do something similar to Firefox and apply a default expiration or to otherwise limit the size of the cache (perhaps using a LRU or LFU model for expulsion), but refusing to cache eminently cacheable responses is not a good basis for limiting resource usage.

ionrock commented 10 years ago

From Section 13.4:

If there is neither a cache validator nor an explicit expiration time associated with a response, we do not expect it to be cached, but certain caches MAY violate this expectation (for example, when little or no network connectivity is available).

This suggests that assuming the specific cache system has chosen not to cache all requests (Section 13.4 says "a caching system MAY always store a successful response"), there shouldn't be an expectation a response will be cached unless there is some cache validator or expiration time. In the case of CacheControl, we do not choose to cache all successful requests by default, therefore, we do require some indication that caching should be involved by way of an ETag and/or some sort of time based caching header.

To back my understanding, if you observe how your browser behaves, you'll see that it will cache such responses. Consider this resource: http://xkcd.com/1/info.0.json

Here is what I'm seeing for the response for that URL:

>>> import requests
>>> resp = requests.get('http://xkcd.com/1/info.0.json')
>>> pprint(dict(resp.headers))
{'accept-ranges': 'bytes',
 'content-length': '431',
 'content-type': 'application/json; charset=utf-8',
 'date': 'Mon, 21 Jul 2014 16:24:24 GMT',
 'etag': '"1378720664"',
 'last-modified': 'Mon, 21 Jul 2014 04:00:06 GMT',
 'server': 'lighttpd/1.4.28'}

In this case CacheControl should cache this response because there is an ETag.

It would be reasonable for Cache-Control to do something similar to Firefox and apply a default expiration or to otherwise limit the size of the cache (perhaps using a LRU or LFU model for expulsion), but refusing to cache eminently cacheable responses is not a good basis for limiting resource usage.

Seeing as CacheControl does not make an effort to provide cache storage for all user's needs, adding a specific cache expiration strategy is outside the scope of CacheControl's defaults. Similarly, as CacheControl is library, not an application like Firefox, my assumption is to leave optimization of a user's needs to the user. I think we all know devs who have been bitten by overly aggressive caches that have put a bad taste in their mouths regarding caching ;)

Now, with that said, I'm not opposed to adding different caching strategies to CacheControl that are not the default. Adding a strategy that allows users to always cache in a similar way that the browser does seems like a helpful tool. I've thought about this a little bit in the past but haven't had a need myself.

If I were going to implement something like this, I would start by adding to the transport adapter some hooks to add headers to the response automatically that will trigger the caching. For example, if there is no ETag, we might take a hash of the content and headers and add a CacheControl-ETag header that would trigger the cache. The other thing I would do is to add a caching strategy to the storage layer. This is less generic as some tools provide their own expiration mechanisms. A LRU option for the FileCache might be a reasonable addition. Finally, I would create a new class to use these extensions. Maybe something like a BrowserCacheControl object that you could use in place of the normal CacheControl object when you want caching to act like a browser.

My caveat here is that I haven't tried to implement this before and my assumptions on how to do it could be incorrect.

ionrock commented 10 years ago

Looking a bit further at the clarified spec it looks like there is a section on calculating freshness heuristics that could be used to create a more browser-like cache strategy. This idea of an Expires Heuristic provides a good name for configuring CacheControl with some means of providing support like a typical browser.

jaraco commented 10 years ago

In this case CacheControl should cache this response because there is an ETag.

You're right. I've made a mistake about the example. The reason I thought it was not being cached was because subsequent requests were taking longer than initial requests. I'll provide more detail and analysis later.

Thanks for helping me think this through.

jaraco commented 10 years ago

After much more consideration, I do believe that CacheControl should offer some flexibility on determining which objects to cache.

The use case I have above does get cached, but because the item isn't considered "fresh", it gets validated with every request, so provides minimal benefit over not caching at all.

I've started looking at the code to devise a way to allow hooks for a strategy to be supplied, but I'm getting stuck. So I'm just going to leave this in your court as a feature request, and in the meantime not use CacheControl for my use case.

ionrock commented 10 years ago

@jaraco I've added an idea of a Heuristic in the above commit that you can use to adjust the response headers. It is very basic an doesn't make any efforts to be correct, but I think it does let you side step confirmation request used with an ETag. It also uses warning headers in the response to signify the response might be stale. There are some docs in the docs directory to help get started. This is in the expires-heuristics branch.

Keep me posted if this works for you.

ionrock commented 10 years ago

Here is how you can cache each response forever (more or less).

from cachecontrol.heuristic import BaseHeuristic
from email.utils import parsedate, formatdate

class CacheForever(BaseHeuristic):
    def update_headers(self, response):
        headers['expires'] = 'Sun, 17-Jan-2038 19:14:07 GMT'
        headers['cache-control'] = 'public'
        return headers

Then to use the heuristic, we pass it to the CacheControl constructor:

from cachecontrol import CacheControl
from requests import Session

sess = CacheControl(Session(), heuristic=CacheForever())

I realize that there is a decent chance that the specific heuristic might be rather tricky to get right in some situations. As others give this method a try, I'm happy to include helpers that might be valuable. For example, the docs contain an example using date parsing and formatting from the email.utils. This sort of functionality seems like it could be made available so as to avoid everyone reimplementing how to properly write date headers.

@jaraco, if you get a chance to try this out, let me know how it works out.