psf / cachecontrol

The httplib2 caching algorithms packaged up for use with requests.
Other
466 stars 122 forks source link

Head request is not cached even with cacheable_methods=("GET", HEAD") without a previous GET #337

Open Guts opened 1 month ago

Guts commented 1 month ago

Hello,

Thanks for this package, it sounds really useful and fit my needs. I'm trying to use it to improve the Mkdocs RSS plugin when it comes to retrieve a remote image length as expected by the enclosure tag. For now, a HEAD request is tried to read the value from response content-length header. If it fails,

See related code: https://github.com/Guts/mkdocs-rss-plugin/blob/68c62e5b579b408dbc9999b251bb7c13c562cee8/mkdocs_rss_plugin/util.py#L620-L668

Here comes my quick & dirty dev script to test it quickly:

import http.client
import logging
from pathlib import Path

import requests
from cachecontrol import CacheControl
from cachecontrol.caches.file_cache import FileCache

http.client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
req_log = logging.getLogger("requests.packages.urllib3")
req_log.setLevel(logging.DEBUG)
req_log.propagate = True

sess = CacheControl(
    requests.Session(), cache=FileCache(".web_cache"), cacheable_methods=("HEAD", "GET")
)

# head requests
resp_img = sess.head(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)

# try again, cache hit expected
resp_img = sess.head(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)

print(list(Path(".web_cache").iterdir()))

But reading the log, I can see that the cache is not used, nor even stored:

DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cdn.geotribu.fr:443
send: b'HEAD /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/1.1\r\nHost: cdn.geotribu.fr\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 25 Jun 2024 10:37:50 GMT
header: Server: Apache/2.4.59 (Ubuntu)
header: Strict-Transport-Security: max-age=15552001; includeSubDomains;
header: Vary: User-Agent,Accept-Encoding
header: Upgrade: h2,h2c
header: Connection: Upgrade, Keep-Alive
header: Last-Modified: Mon, 15 May 2023 11:16:04 GMT
header: ETag: "1784e-5fbb994a06b57"
header: Accept-Ranges: bytes
header: Content-Length: 96334
header: Cache-Control: max-age=31536000, public
header: Keep-Alive: timeout=5, max=100
header: Content-Type: image/png
DEBUG:urllib3.connectionpool:https://cdn.geotribu.fr:443 "HEAD /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/11" 200 0
DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:cachecontrol.controller:No cache entry available
send: b'HEAD /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/1.1\r\nHost: cdn.geotribu.fr\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 25 Jun 2024 10:37:50 GMT
header: Server: Apache/2.4.59 (Ubuntu)
header: Strict-Transport-Security: max-age=15552001; includeSubDomains;
header: Vary: User-Agent,Accept-Encoding
header: Last-Modified: Mon, 15 May 2023 11:16:04 GMT
header: ETag: "1784e-5fbb994a06b57"
header: Accept-Ranges: bytes
header: Content-Length: 96334
header: Cache-Control: max-age=31536000, public
header: Keep-Alive: timeout=5, max=99
header: Connection: Keep-Alive
header: Content-Type: image/png
DEBUG:urllib3.connectionpool:https://cdn.geotribu.fr:443 "HEAD /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/11" 200 0
Traceback (most recent call last):
  File "/home/jmo/Git/Perso/mkdocs-rss-plugin/tests/dev/dev_cached_http.py", line 45, in <module>
    print(list(Path(".web_cache").iterdir()))
  File "/usr/lib/python3.10/pathlib.py", line 1017, in iterdir
    for name in self._accessor.listdir(self):
FileNotFoundError: [Errno 2] No such file or directory: '.web_cache'

BUT if I make a GET request to the same resource before the HEAD, the cache is stored AND even read for the HEAD:

import http.client
import logging
from pathlib import Path

import requests
from cachecontrol import CacheControl
from cachecontrol.caches.file_cache import FileCache

http.client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
req_log = logging.getLogger("requests.packages.urllib3")
req_log.setLevel(logging.DEBUG)
req_log.propagate = True

sess = CacheControl(
    requests.Session(), cache=FileCache(".web_cache"), cacheable_methods=("HEAD", "GET")
)

# get requests
resp = sess.get("https://geotribu.fr")
resp_img = sess.get(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)

# try again, cache hit expected
resp = sess.get("https://geotribu.fr")
resp_img = sess.get(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)

# head requests
resp_img = sess.head(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)

# try again, cache hit expected
resp_img = sess.head(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)

print(list(Path(".web_cache").iterdir()))

Logs:

DEBUG:cachecontrol.controller:Looking up "https://geotribu.fr/" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): geotribu.fr:443
send: b'GET / HTTP/1.1\r\nHost: geotribu.fr\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Connection: keep-alive
header: Content-Length: 36931
header: Server: GitHub.com
header: Content-Type: text/html; charset=utf-8
header: Last-Modified: Mon, 24 Jun 2024 17:15:31 GMT
header: Access-Control-Allow-Origin: *
header: ETag: W/"6679a9b3-1c64a"
header: expires: Tue, 25 Jun 2024 05:36:20 GMT
header: Cache-Control: max-age=600
header: Content-Encoding: gzip
header: x-proxy-cache: MISS
header: X-GitHub-Request-Id: 9410:182824:28A2A85:2986177:667A54FB
header: Accept-Ranges: bytes
header: Age: 311
header: Date: Tue, 25 Jun 2024 10:42:09 GMT
header: Via: 1.1 varnish
header: X-Served-By: cache-par-lfpg1960074-PAR
header: X-Cache: HIT
header: X-Cache-Hits: 0
header: X-Timer: S1719312129.348531,VS0,VE1
header: Vary: Accept-Encoding
header: X-Fastly-Request-ID: 4197c8a81421f5a2aed12f7abecd529eb36a01bb
DEBUG:urllib3.connectionpool:https://geotribu.fr:443 "GET / HTTP/11" 200 36931
DEBUG:cachecontrol.controller:Updating cache with response from "https://geotribu.fr/"
DEBUG:cachecontrol.controller:etag object cached for 1209600 seconds
DEBUG:cachecontrol.controller:Caching due to etag
DEBUG:filelock:Attempting to acquire lock 135385395530768 on .web_cache/a/e/3/4/8/ae3480509cd61ad7727667c13727f5909a71b094423dffa3730b3ab8.lock
DEBUG:filelock:Lock 135385395530768 acquired on .web_cache/a/e/3/4/8/ae3480509cd61ad7727667c13727f5909a71b094423dffa3730b3ab8.lock
DEBUG:filelock:Attempting to release lock 135385395530768 on .web_cache/a/e/3/4/8/ae3480509cd61ad7727667c13727f5909a71b094423dffa3730b3ab8.lock
DEBUG:filelock:Lock 135385395530768 released on .web_cache/a/e/3/4/8/ae3480509cd61ad7727667c13727f5909a71b094423dffa3730b3ab8.lock
DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cdn.geotribu.fr:443
send: b'GET /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/1.1\r\nHost: cdn.geotribu.fr\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 25 Jun 2024 10:42:09 GMT
header: Server: Apache/2.4.59 (Ubuntu)
header: Strict-Transport-Security: max-age=15552001; includeSubDomains;
header: Vary: User-Agent,Accept-Encoding
header: Upgrade: h2,h2c
header: Connection: Upgrade, Keep-Alive
header: Last-Modified: Mon, 15 May 2023 11:16:04 GMT
header: ETag: "1784e-5fbb994a06b57"
header: Accept-Ranges: bytes
header: Content-Length: 96334
header: Cache-Control: max-age=31536000, public
header: Keep-Alive: timeout=5, max=100
header: Content-Type: image/png
DEBUG:urllib3.connectionpool:https://cdn.geotribu.fr:443 "GET /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/11" 200 96334
DEBUG:cachecontrol.controller:Updating cache with response from "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
DEBUG:cachecontrol.controller:etag object cached for 1209600 seconds
DEBUG:cachecontrol.controller:Caching due to etag
DEBUG:filelock:Attempting to acquire lock 135385395533696 on .web_cache/8/d/2/2/c/8d22c12b7f7b5166ec3853b07d4b8dc0cb5c13a8eba0e19e34e4fb70.lock
DEBUG:filelock:Lock 135385395533696 acquired on .web_cache/8/d/2/2/c/8d22c12b7f7b5166ec3853b07d4b8dc0cb5c13a8eba0e19e34e4fb70.lock
DEBUG:filelock:Attempting to release lock 135385395533696 on .web_cache/8/d/2/2/c/8d22c12b7f7b5166ec3853b07d4b8dc0cb5c13a8eba0e19e34e4fb70.lock
DEBUG:filelock:Lock 135385395533696 released on .web_cache/8/d/2/2/c/8d22c12b7f7b5166ec3853b07d4b8dc0cb5c13a8eba0e19e34e4fb70.lock
DEBUG:cachecontrol.controller:Looking up "https://geotribu.fr/" in the cache
DEBUG:cachecontrol.controller:Current age based on date: 0
DEBUG:cachecontrol.controller:Freshness lifetime from max-age: 600
DEBUG:cachecontrol.controller:The response is "fresh", returning cached response
DEBUG:cachecontrol.controller:600 > 0
DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:Current age based on date: 0
DEBUG:cachecontrol.controller:Freshness lifetime from max-age: 31536000
DEBUG:cachecontrol.controller:The response is "fresh", returning cached response
DEBUG:cachecontrol.controller:31536000 > 0
DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:Current age based on date: 0
DEBUG:cachecontrol.controller:Freshness lifetime from max-age: 31536000
DEBUG:cachecontrol.controller:The response is "fresh", returning cached response
DEBUG:cachecontrol.controller:31536000 > 0
DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:Current age based on date: 0
DEBUG:cachecontrol.controller:Freshness lifetime from max-age: 31536000
DEBUG:cachecontrol.controller:The response is "fresh", returning cached response
DEBUG:cachecontrol.controller:31536000 > 0
[PosixPath('.web_cache/8'), PosixPath('.web_cache/a')]

Possibly related issue: https://github.com/psf/cachecontrol/issues/216

woodruffw commented 1 month ago

Thanks for the report @Guts!

To make sure I'm understanding: #216 looks related, but is describing different behavior (HEAD clobbers the GET cache). Can you confirm that you're seeing the same thing, in addition to the behavior around HEAD-only sequences not being cached?

Guts commented 1 month ago

To make sure I'm understanding: https://github.com/psf/cachecontrol/issues/216 looks related, but is describing different behavior (HEAD clobbers the GET cache). Can you confirm that you're seeing the same thing, in addition to the behavior around HEAD-only sequences not being cached?

@woodruffw Not sure, I should have written "possibly related issue". I've edited.

Guts commented 1 month ago

Any chance to see it fixed @woodruffw? Sorry, I cant' help on this since it's far away from my skills. I just need if it's something planned or not, to adapt my proper plans consequently.

woodruffw commented 1 month ago

I'll try and find the time to make a fix for this over the coming weekend, but I can't make any promises, sorry. If this is an urgent behavioral change for you, I'd suggest working around it for now 🙂

woodruffw commented 1 month ago

I took a quick look at this, and I think it's unfortunately going to be non-trivial to implement:

  1. CacheControlAdapter.send is the entrypoint into the cache; it calls cached_request(request) to retrieve the response if the HTTP method is marked as cacheable;
  2. cached_request is (seemingly) agnostic to the HTTP method itself: it pulls from the cache based on a canonicalized URL.

To perform a general fix here, we'd probably need to update the cache keying logic to treat HEAD and GET as separate-but-cascading keys, i.e. hitting HEAD if present and then falling back to GET if not independently cached. But this would require a substantial refactor of the existing controller/adapter, and I'm not sure if it's worth it (given that the value of caching a lightweight method like HEAD are marginal for most users, and AFAICT other middleware does not typically cache it).

Taking a step back: could you share a bit more about your use case? You mentioned that you're performing HEAD to get conten-length of an image ahead of time, but is there an architectural reason why you need to issue HEAD multiple times (and expect it to be cached)?

woodruffw commented 1 month ago

2. cached_request is (seemingly) agnostic to the HTTP method itself: it pulls from the cache based on a canonicalized URL.

To highlight this underlying architectural decision, here's an example of how CacheControl leaks the cached GET's body into a subsequent HEAD response:

import requests

from cachecontrol import CacheControl
from cachecontrol.cache import DictCache

sess = CacheControl(
    requests.Session(), cache=DictCache(), cacheable_methods=("HEAD", "GET")
)

# misses the cache
resp1 = sess.head("https://example.com")

assert resp1.request.method == "HEAD"
assert not resp1.content

# primes the cache
resp2 = sess.get("https://example.com")

assert resp2.request.method == "GET"
assert resp2.content

# hits the cache from the previous GET
resp3 = sess.head("https://example.com")

assert resp3.request.method == "HEAD"
# fails because the cached GET is returned as HEAD
assert not resp3.content, resp3.content

This is arguably incorrect, since a HEAD response should never contain a message-body (RFC 2616 9.4). But this flaw has probably been present for quite a while, at least when people enable HEAD caching 🙂

TL;DR: I think that cacheable_methods might be an API mis-feature within CacheControl, one that has never fully worked correctly (or rather does something resembling correctness, but not consistent with the HTTP/1.1 RFC).