Closed banesullivan closed 2 years ago
This issue is general to all usage of FileField
and has nothing specific to do with S3FileField
.
Normally, I hope people aren't using field_file_to_local_path
very much. A Django File
is a stream in its own right, and should ideally be sufficient for almost all native-Django operations. The reason it exists is to support third-party libraries which can insist on having a path on disk. I don't think it should be caching anything.
However, I think a natural place to implement file caching is at the Storage layer. Since all file access flows through a Storage, it would be the best positioned to determine whether it could instead provide a file from a cache. Of course, django-girder-utils
would be a fine place to host this sort of Storage subclass too.
As a side note, I personally think that a design which requires thread-locking or serving binary files from Django (instead of from S3 or a microservice) may be a 12-factor anti-pattern and somewhat at odds with how Django is best used.
Even more generally, HTTP endpoints should probably try to avoid running their own synchronous HTTP calls, especially if it involves binary file transfer, as it's too slow and too unreliable; I can think of some judicious exceptions to this, however.
Regardless, I agree that file caching in itself can be a helpful thing. Celery tasks may particularly benefit (though data locality can't be assured).
The more I think about this, data locality is going to be a major barrier to making caching horizontally scalable. You've got to assume that a given HTTP request or Celery task will run in a stateless environment (since restarts can happen at any time) and that it's going to be load balanced to an arbitrary Heroku dyno or worker node.
Caching is really difficult, so to be worthwhile to implement it, you should probably be needing to operate at a large scale already. So, outside of a vertically scaling environment with very particular data sizes and access patterns (which maybe RGD does fit into), I'm not sure how generally useful file caching may be.
Generally for processing with data locality, I think Dask might be a good choice. For HTTP requests, re-architecting your application to not depend on synchronous file processing in a request may yield better results.
This issue is general to all usage of FileField and has nothing specific to do with S3FileField.
Well, it kind of does have something specific to do with S3FF... For the scenario of a plain old FileField, it is always at the same, predictable path so caching is a non-issue:
But for any other subclass (being S3FF), this utility creates a new temporary file on each run:
Or am I misunderstanding something here?
Normally, I hope people aren't using
field_file_to_local_path
very much.
Oh yeah, absolutely. However, we need to use it quite a bit in RGD until we can implement a more robust FUSE interface for accessing S3FF resources.
A Django File is a stream in its own right, and should ideally be sufficient for almost all native-Django operations. The reason it exists is to support third-party libraries which can insist on having a path on disk. I don't think it should be caching anything.
Yep, our problem is that we need to use TPL C++ libs like GDAL which expect a named file on disk, not a stream. Why don't you think these should be cached?
Further, we use packages like large_image
which do their own caching that, AFAIK, assume the file to be persistent and so the fact that field_file_to_local_path
immediately removes the temp file is problematic on its own in our use case.
Even more generally, HTTP endpoints should probably try to avoid running their own synchronous HTTP calls, especially if it involves binary file transfer, as it's too slow and too unreliable; I can think of some judicious exceptions to this, however.
I can't think of any clean way to work around this when it comes to image tile serving... so perhaps RGD's use case is one of those judicious exceptions?
field_file_to_local_path
currently downloads an S3FileField to a new temporary file every time it is called. This is highly problematic if you have different processes all accessing the same file field at the same time (e.g. tile serving, cc @manthey).We had to override this downstream in RGD to support caching and thread locking. To do so, I implemented a mechanism on the host model of the filefield such that it will have a predetermined path for local downloading - this path is then used with
filelock
to create block on that file resource such that fieldfiles will be cached and only downloaded once across threads/processes.See https://github.com/ResonantGeoData/ResonantGeoData/pull/548 for the implementation.
Considering that this function originated from RGD, would it be of interest to handle the caching and resource blocking here and perhaps come up with a more robust solution?