Open jasongi opened 7 months ago
I'm experiencing super slow times with collectstatic to S3. A solution would be nice since it costs $.
I will say it was an oversight of mine not to search GH to see how many consumers of the preload metadata API existed before I removed it.
And similarly, you understand why it was removed.
That being said I am happy to work on a solution that improves this for people since it seems there is still significant demand. In general, back when I was working with Django, I had switched to Whitenoise, but I do understand that doesn't work for everyone.
Can you propose a new API that would meet all the needs? It's possible we could upstream it once it was developed out and since this library has BaseStorage
it might be okay to do something general.
I have had a read through the collectstatic and HashedFilesMixin code. I think adding the following methods to the Storage class would help us refactor to improve performance, if we are also able to refactor HashedFilesMixin and the collectstatic to use them (initially this could be done via Collectfasta).
exists_bulk: to be called in HashedFilesMixin._post_process - for S3 can fetch the bucket using listdir, map to a dictionary by filename then look up each key. We can cap this by a configurable amount of files and default back to using HeadObject to avoid major performance issues if it is run on a big bucket.
delete_bulk: to be called in HashedFilesMixin._post_process - for S3 can use the DeleteObjects call which can delete up to 1000 objects
save_changed_files: to be used in collectstatic and HashedFilesMixin._post_process. This should hash the files and remote files and only save the file if the remote files have changed. For S3, we can do this using ETag which is what Collectfasta currently uses, we can also optimally gzip the content before hashing the files if the bucket has gzip enabled, this is also what Collectfasta currently does.
Basic strategy would be to collect a list of all the files that HashedFilesMixin calls self._save, self.exists and self.delete on and defer those calls to the bulk method at the end of post_process so we can do them in bulk.
For collectstatic, we need to collect all files then send them to save_changed_files and let the storage decide if the file needs deletion rather than calling self.delete_file then self.storage.save one at a time. Basically this means deleting the ”handler” methods (link_file and copy_file) and shifting that logic into the new storage class “save_changed_files”.
Symlinking should probably be moved into the FileSystemSymlinkedStorage class and the —link option removed rather than having os calls in collectstatic.
Note that Collectfasta also does things with threads and caching hashes in the Django cache, but I think these don’t provide as much of a performance boost that the bulk fetching does in my tests unless you have some seriously large files or slow hard drives.
A few years ago the AWS_PRELOAD_METADATA setting was removed in this PR: https://github.com/jschneier/django-storages/pull/636
I understand why - this is not something you’d want to turn on when servicing web requests and was probably the source of many complaints. But it was crucial to the performance optimisations that collectfast made. I’ve recently forked collectfast into Collectfasta as collectfast had been archived- but I’ve had to reimplement a lot of the S3Storage methods to add back the preloading functionality in order to maintain the performance improvements. This is not sustainable as the code will need to be kept in sync with each django-storages release.
My proposal: we move this logic back into django-storages (I am happy to contribute) but with some safeguards to ensure it only runs during management commands. Some ideas on this below (not all need to be implemented):
If you want to read about why collectstatic performance is so bad I have a write-up on it here - unfortunately the way that the core Django storage API works - in particular ManifestStaticFilesStorage, it makes it very difficult to do a performant collectstatic command with just the methods available in the Storage interface if your storage class deals with remote storage.