Open hsuominen opened 3 months ago
Can you elaborate on the use case please? What type of format are you thinking of?
Appreciate the quick response. Basically I'm hoping that rosettasciio would support a similar interface to e.g. pandas or imageio:
https://imageio.readthedocs.io/en/stable/_autosummary/imageio.v3.imread.html#imageio.v3.imread https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv
This would enable smoother use in distributed applications where the actual loading of the file is done without access to the original filesystem on which the file is stored, and would just be passed as a file-like object: (copied from pandas docs):
By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.
Off the top of my head, there may be already a few formats that can do that but I suspect that rosettasciio supports a wider variety of type of file than imageio and pandas and depending on the type, it may behave differently.
Here is a list of the different type of files
There should be some low hanging fruit as it should be easy to implement for some type.
@hsuominen is the idea that you are loading data that isn't on the computer doing the operation?
I think zarr might be a good place to start. https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.LRUStoreCache
This store implementation uses a LRU cache over an s3 bucket which might be interesting if aws is hosting data.
@hsuominen is the idea that you are loading data that isn't on the computer doing the operation?
yes that's right.
Our intent is to get the data out of proprietary formats and into e.g. zarr (which looks great), but we need to run this extraction on compute that doesn't have the files sitting locally. There are fairly easy workarounds (e.g. using a TempFile) but thought it would be good to get this discussion going as I can see others eventually running into similar needs.
Looking specifically at some of the file formats we are interested in, the changes needed in some cases would be pretty trivial (as @ericpre hinted): https://github.com/hyperspy/rosettasciio/blob/e49911047c03465a84facac546053576a23ef915/rsciio/digitalmicrograph/_api.py#L1278-L1279
but likely harder in others: https://github.com/hyperspy/rosettasciio/blob/e49911047c03465a84facac546053576a23ef915/rsciio/emd/_api.py#L171-L173
Describe the functionality you would like to see.
For a number of applications it would be preferable if file reading supported file-like objects as well as strings or paths.