Open Guts opened 1 month ago
Thanks for the report @Guts!
To make sure I'm understanding: #216 looks related, but is describing different behavior (HEAD
clobbers the GET
cache). Can you confirm that you're seeing the same thing, in addition to the behavior around HEAD
-only sequences not being cached?
To make sure I'm understanding: https://github.com/psf/cachecontrol/issues/216 looks related, but is describing different behavior (HEAD clobbers the GET cache). Can you confirm that you're seeing the same thing, in addition to the behavior around HEAD-only sequences not being cached?
@woodruffw Not sure, I should have written "possibly related issue". I've edited.
Any chance to see it fixed @woodruffw? Sorry, I cant' help on this since it's far away from my skills. I just need if it's something planned or not, to adapt my proper plans consequently.
I'll try and find the time to make a fix for this over the coming weekend, but I can't make any promises, sorry. If this is an urgent behavioral change for you, I'd suggest working around it for now 🙂
I took a quick look at this, and I think it's unfortunately going to be non-trivial to implement:
CacheControlAdapter.send
is the entrypoint into the cache; it calls cached_request(request)
to retrieve the response if the HTTP method is marked as cacheable;cached_request
is (seemingly) agnostic to the HTTP method itself: it pulls from the cache based on a canonicalized URL.To perform a general fix here, we'd probably need to update the cache keying logic to treat HEAD
and GET
as separate-but-cascading keys, i.e. hitting HEAD
if present and then falling back to GET
if not independently cached. But this would require a substantial refactor of the existing controller/adapter, and I'm not sure if it's worth it (given that the value of caching a lightweight method like HEAD
are marginal for most users, and AFAICT other middleware does not typically cache it).
Taking a step back: could you share a bit more about your use case? You mentioned that you're performing HEAD
to get conten-length of an image ahead of time, but is there an architectural reason why you need to issue HEAD
multiple times (and expect it to be cached)?
2.
cached_request
is (seemingly) agnostic to the HTTP method itself: it pulls from the cache based on a canonicalized URL.
To highlight this underlying architectural decision, here's an example of how CacheControl leaks the cached GET
's body into a subsequent HEAD
response:
import requests
from cachecontrol import CacheControl
from cachecontrol.cache import DictCache
sess = CacheControl(
requests.Session(), cache=DictCache(), cacheable_methods=("HEAD", "GET")
)
# misses the cache
resp1 = sess.head("https://example.com")
assert resp1.request.method == "HEAD"
assert not resp1.content
# primes the cache
resp2 = sess.get("https://example.com")
assert resp2.request.method == "GET"
assert resp2.content
# hits the cache from the previous GET
resp3 = sess.head("https://example.com")
assert resp3.request.method == "HEAD"
# fails because the cached GET is returned as HEAD
assert not resp3.content, resp3.content
This is arguably incorrect, since a HEAD
response should never contain a message-body (RFC 2616 9.4). But this flaw has probably been present for quite a while, at least when people enable HEAD caching 🙂
TL;DR: I think that cacheable_methods
might be an API mis-feature within CacheControl, one that has never fully worked correctly (or rather does something resembling correctness, but not consistent with the HTTP/1.1 RFC).
Hello,
Thanks for this package, it sounds really useful and fit my needs. I'm trying to use it to improve the Mkdocs RSS plugin when it comes to retrieve a remote image length as expected by the enclosure tag. For now, a HEAD request is tried to read the value from response content-length header. If it fails,
See related code: https://github.com/Guts/mkdocs-rss-plugin/blob/68c62e5b579b408dbc9999b251bb7c13c562cee8/mkdocs_rss_plugin/util.py#L620-L668
Here comes my quick & dirty dev script to test it quickly:
But reading the log, I can see that the cache is not used, nor even stored:
BUT if I make a GET request to the same resource before the HEAD, the cache is stored AND even read for the HEAD:
Logs:
Possibly related issue: https://github.com/psf/cachecontrol/issues/216