jschneier / django-storages

https://django-storages.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
2.7k stars 847 forks source link

s3: open()/read() does too many HEAD requests #1407

Open craigds opened 1 month ago

craigds commented 1 month ago

We've noticed that using S3Storage.open("file.x").read() does a lot of HEAD requests in addition to the GET:

"HEAD /file.x HTTP/1.1" 200 0
"HEAD /file.x HTTP/1.1" 200 0
"GET /file.x HTTP/1.1" 200 123

These are caused by:

When called in a tight loop these extra requests can slow things down a fair bit, especially for large numbers of small files.

I propose:

  1. Eliminate the request in the constructor by just hitting self.file (thus triggering the download_fileobj right away.). Probably most callers will be calling .read() immediately anyway. Add a config option (EAGER_DOWNLOAD?) to opt out if you really don't want to, but I don't see any common reason you wouldn't - If you don't want to read the file but just want object size or something, you don't need to call S3Storage.open() at all, you can use S3Storage.size()
  2. Eliminate the request in the download_fileobj by using get() instead of download_fileobj. This will probably be context-dependent (for larger files, download_fileobj may perform better), so it probably needs to be opt-in via a setting - what about USE_MULTIPART_DOWNLOAD?

Thanks for your consideration :)

craigds commented 1 month ago

to be clear i'm happy to submit a PR to implement this change :) just wanted to solicit some feedback on the ideas first

jschneier commented 1 month ago

Thanks for opening this, people also pay for these requests so best to minimize.

I strongly want to avoid adding settings where possible.

For option 1, would we still get an exception if you try to read a file that doesn't exist? As long as we maintain that invariant I think that is certainly the best way.

Am happy to accept a PR for this!