jschneier / django-storages

https://django-storages.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
2.76k stars 863 forks source link

Reintroduction of AWS_PRELOAD_METADATA functionality #1373

Open jasongi opened 7 months ago

jasongi commented 7 months ago

A few years ago the AWS_PRELOAD_METADATA setting was removed in this PR: https://github.com/jschneier/django-storages/pull/636

I understand why - this is not something you’d want to turn on when servicing web requests and was probably the source of many complaints. But it was crucial to the performance optimisations that collectfast made. I’ve recently forked collectfast into Collectfasta as collectfast had been archived- but I’ve had to reimplement a lot of the S3Storage methods to add back the preloading functionality in order to maintain the performance improvements. This is not sustainable as the code will need to be kept in sync with each django-storages release.

My proposal: we move this logic back into django-storages (I am happy to contribute) but with some safeguards to ensure it only runs during management commands. Some ideas on this below (not all need to be implemented):

  1. Only allow it for static files storage classes - not other classes. This would be somewhat difficult as all the logic exists in the base classes so it would require a bit of refactoring of the base class to work. It also doesn’t solve it not running when serving requests
  2. Do not allow it to be set through settings at all - within the management command we can manipulate the storage object directly. Kinda feels ick doing this.
  3. Use an ENV var rather than Django settings - management commands can set the env var directly or users can do it when running the command. The reason I suggest this is that you don’t want to have a separate env module just for running collectstatic
  4. Detect the command that is being run somehow. Perhaps by checking sys.argv - I doubt this works with call_command so it probably not my first choice. As far as I know there’s no functionality in Django to tell within an arbitrary piece of code (i.e where you don’t have a request object) if you’re serving requests or running a management command.

If you want to read about why collectstatic performance is so bad I have a write-up on it here - unfortunately the way that the core Django storage API works - in particular ManifestStaticFilesStorage, it makes it very difficult to do a performant collectstatic command with just the methods available in the Storage interface if your storage class deals with remote storage.

davidmir commented 6 months ago

I'm experiencing super slow times with collectstatic to S3. A solution would be nice since it costs $.

jschneier commented 6 months ago

I will say it was an oversight of mine not to search GH to see how many consumers of the preload metadata API existed before I removed it.

And similarly, you understand why it was removed.

That being said I am happy to work on a solution that improves this for people since it seems there is still significant demand. In general, back when I was working with Django, I had switched to Whitenoise, but I do understand that doesn't work for everyone.

Can you propose a new API that would meet all the needs? It's possible we could upstream it once it was developed out and since this library has BaseStorage it might be okay to do something general.

jasongi commented 3 months ago

I have had a read through the collectstatic and HashedFilesMixin code. I think adding the following methods to the Storage class would help us refactor to improve performance, if we are also able to refactor HashedFilesMixin and the collectstatic to use them (initially this could be done via Collectfasta).

exists_bulk: to be called in HashedFilesMixin._post_process - for S3 can fetch the bucket using listdir, map to a dictionary by filename then look up each key. We can cap this by a configurable amount of files and default back to using HeadObject to avoid major performance issues if it is run on a big bucket.

delete_bulk: to be called in HashedFilesMixin._post_process - for S3 can use the DeleteObjects call which can delete up to 1000 objects

save_changed_files: to be used in collectstatic and HashedFilesMixin._post_process. This should hash the files and remote files and only save the file if the remote files have changed. For S3, we can do this using ETag which is what Collectfasta currently uses, we can also optimally gzip the content before hashing the files if the bucket has gzip enabled, this is also what Collectfasta currently does.

Basic strategy would be to collect a list of all the files that HashedFilesMixin calls self._save, self.exists and self.delete on and defer those calls to the bulk method at the end of post_process so we can do them in bulk.

For collectstatic, we need to collect all files then send them to save_changed_files and let the storage decide if the file needs deletion rather than calling self.delete_file then self.storage.save one at a time. Basically this means deleting the ”handler” methods (link_file and copy_file) and shifting that logic into the new storage class “save_changed_files”.

Symlinking should probably be moved into the FileSystemSymlinkedStorage class and the —link option removed rather than having os calls in collectstatic.

Note that Collectfasta also does things with threads and caching hashes in the Django cache, but I think these don’t provide as much of a performance boost that the bulk fetching does in my tests unless you have some seriously large files or slow hard drives.