psf / cachecontrol

The httplib2 caching algorithms packaged up for use with requests.
Other
465 stars 122 forks source link

Investigate failure to load caches above `2^32 - 1` GB #336

Open woodruffw opened 3 weeks ago

woodruffw commented 3 weeks ago

Opening this as a reminder to myself.

This is likely related to #238 and #200: some recent torch wheels are >= 2.5GB, and pip appears to download them repeatedly without hitting the cache. My only SWAG so far is that this is because the body itself overflows msgpack's signed 32 bit limit on binary objects, per the spec.

Haven't fully diagnosed yet.

See: https://news.ycombinator.com/item?id=40659973

woodruffw commented 3 weeks ago

Looked some more into this: the person who reported this said that torch was serving 2 GB+ wheels, but I can't see any: https://pypi.org/project/torch/#files

That being said, I suspect this is still causing unnecessary cache misses due to #200: we end up storing large downloads (such as 700 MB torche wheels) that never get "hit", since the default msgpack load behavior is to limit binary bodies to ~100MB: https://msgpack-python.readthedocs.io/en/latest/api.html

woodruffw commented 3 weeks ago

Hmm, I've still been unable to trigger this: it looks like msgpack.loads(payload) sets its maximum limits based on len(payload), so we should never really hit a binary object limit in practice.