kitware-resonant / django-resonant-utils

Django utilities for data management applications.
Apache License 2.0
2 stars 1 forks source link

Consider cached field_file_to_local_path #36

Closed banesullivan closed 2 years ago

banesullivan commented 2 years ago

field_file_to_local_path currently downloads an S3FileField to a new temporary file every time it is called. This is highly problematic if you have different processes all accessing the same file field at the same time (e.g. tile serving, cc @manthey).

We had to override this downstream in RGD to support caching and thread locking. To do so, I implemented a mechanism on the host model of the filefield such that it will have a predetermined path for local downloading - this path is then used with filelock to create block on that file resource such that fieldfiles will be cached and only downloaded once across threads/processes.

See https://github.com/ResonantGeoData/ResonantGeoData/pull/548 for the implementation.

Considering that this function originated from RGD, would it be of interest to handle the caching and resource blocking here and perhaps come up with a more robust solution?

brianhelba commented 2 years ago

This issue is general to all usage of FileField and has nothing specific to do with S3FileField.

Normally, I hope people aren't using field_file_to_local_path very much. A Django File is a stream in its own right, and should ideally be sufficient for almost all native-Django operations. The reason it exists is to support third-party libraries which can insist on having a path on disk. I don't think it should be caching anything.

However, I think a natural place to implement file caching is at the Storage layer. Since all file access flows through a Storage, it would be the best positioned to determine whether it could instead provide a file from a cache. Of course, django-girder-utils would be a fine place to host this sort of Storage subclass too.

brianhelba commented 2 years ago

As a side note, I personally think that a design which requires thread-locking or serving binary files from Django (instead of from S3 or a microservice) may be a 12-factor anti-pattern and somewhat at odds with how Django is best used.

Even more generally, HTTP endpoints should probably try to avoid running their own synchronous HTTP calls, especially if it involves binary file transfer, as it's too slow and too unreliable; I can think of some judicious exceptions to this, however.

Regardless, I agree that file caching in itself can be a helpful thing. Celery tasks may particularly benefit (though data locality can't be assured).

brianhelba commented 2 years ago

The more I think about this, data locality is going to be a major barrier to making caching horizontally scalable. You've got to assume that a given HTTP request or Celery task will run in a stateless environment (since restarts can happen at any time) and that it's going to be load balanced to an arbitrary Heroku dyno or worker node.

Caching is really difficult, so to be worthwhile to implement it, you should probably be needing to operate at a large scale already. So, outside of a vertically scaling environment with very particular data sizes and access patterns (which maybe RGD does fit into), I'm not sure how generally useful file caching may be.

Generally for processing with data locality, I think Dask might be a good choice. For HTTP requests, re-architecting your application to not depend on synchronous file processing in a request may yield better results.

banesullivan commented 2 years ago

This issue is general to all usage of FileField and has nothing specific to do with S3FileField.

Well, it kind of does have something specific to do with S3FF... For the scenario of a plain old FileField, it is always at the same, predictable path so caching is a non-issue:

https://github.com/girder/django-girder-utils/blob/587919e4c7b48b67624be40df3389d02380c69bb/girder_utils/files.py#L30-L33

But for any other subclass (being S3FF), this utility creates a new temporary file on each run:

https://github.com/girder/django-girder-utils/blob/587919e4c7b48b67624be40df3389d02380c69bb/girder_utils/files.py#L35-L38

Or am I misunderstanding something here?


Normally, I hope people aren't using field_file_to_local_path very much.

Oh yeah, absolutely. However, we need to use it quite a bit in RGD until we can implement a more robust FUSE interface for accessing S3FF resources.


A Django File is a stream in its own right, and should ideally be sufficient for almost all native-Django operations. The reason it exists is to support third-party libraries which can insist on having a path on disk. I don't think it should be caching anything.

Yep, our problem is that we need to use TPL C++ libs like GDAL which expect a named file on disk, not a stream. Why don't you think these should be cached?

Further, we use packages like large_image which do their own caching that, AFAIK, assume the file to be persistent and so the fact that field_file_to_local_path immediately removes the temp file is problematic on its own in our use case.


Even more generally, HTTP endpoints should probably try to avoid running their own synchronous HTTP calls, especially if it involves binary file transfer, as it's too slow and too unreliable; I can think of some judicious exceptions to this, however.

I can't think of any clean way to work around this when it comes to image tile serving... so perhaps RGD's use case is one of those judicious exceptions?