s3: open()/read() does too many HEAD requests

craigds commented 1 month ago

We've noticed that using S3Storage.open("file.x").read() does a lot of HEAD requests in addition to the GET:

"HEAD /file.x HTTP/1.1" 200 0
"HEAD /file.x HTTP/1.1" 200 0
"GET /file.x HTTP/1.1" 200 123

These are caused by:

first HEAD request is here during the S3File constructor - because opening a file doesn't necessarily mean you're going to read the file, and so a HEAD request is required to ensure the file actually exists
second HEAD request is actually caused by boto3 because we use download_fileobj here. That method is a high-level managed transfer. It does the HEAD request to find out the file size so it can potentially do a multipart download of large files using multiple threads.

When called in a tight loop these extra requests can slow things down a fair bit, especially for large numbers of small files.

I propose:

Eliminate the request in the constructor by just hitting self.file (thus triggering the download_fileobj right away.). Probably most callers will be calling .read() immediately anyway. Add a config option (EAGER_DOWNLOAD?) to opt out if you really don't want to, but I don't see any common reason you wouldn't - If you don't want to read the file but just want object size or something, you don't need to call S3Storage.open() at all, you can use S3Storage.size()
Eliminate the request in the download_fileobj by using get() instead of download_fileobj. This will probably be context-dependent (for larger files, download_fileobj may perform better), so it probably needs to be opt-in via a setting - what about USE_MULTIPART_DOWNLOAD?

Thanks for your consideration :)

craigds commented 1 month ago

to be clear i'm happy to submit a PR to implement this change :) just wanted to solicit some feedback on the ideas first

jschneier commented 1 month ago

Thanks for opening this, people also pay for these requests so best to minimize.

I strongly want to avoid adding settings where possible.

For option 1, would we still get an exception if you try to read a file that doesn't exist? As long as we maintain that invariant I think that is certainly the best way.

Am happy to accept a PR for this!

jschneier / django-storages

s3: open()/read() does too many HEAD requests #1407