Support str, path object or file-like object on file read

hyperspy / rosettasciio

Python library for reading and writing scientific data format

https://hyperspy.org/rosettasciio

GNU General Public License v3.0

51 stars 28 forks source link

Support str, path object or file-like object on file read #302

Open hsuominen opened 3 months ago

hsuominen commented 3 months ago

Describe the functionality you would like to see.

For a number of applications it would be preferable if file reading supported file-like objects as well as strings or paths.

ericpre commented 3 months ago

Can you elaborate on the use case please? What type of format are you thinking of?

hsuominen commented 3 months ago

Appreciate the quick response. Basically I'm hoping that rosettasciio would support a similar interface to e.g. pandas or imageio:

https://imageio.readthedocs.io/en/stable/_autosummary/imageio.v3.imread.html#imageio.v3.imread https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv

This would enable smoother use in distributed applications where the actual loading of the file is done without access to the original filesystem on which the file is stored, and would just be passed as a file-like object: (copied from pandas docs):

By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

ericpre commented 3 months ago

Off the top of my head, there may be already a few formats that can do that but I suspect that rosettasciio supports a wider variety of type of file than imageio and pandas and depending on the type, it may behave differently.

Here is a list of the different type of files

binary file
text file
h5py file - see caveats in https://github.com/h5py/h5py/issues/1698
numpy file
zarr file, maybe some zarr store will work
multiple file, typically a binary and text file, for example ripple format where the metadata are in a separate file

There should be some low hanging fruit as it should be easy to implement for some type.

CSSFrancis commented 3 months ago

@hsuominen is the idea that you are loading data that isn't on the computer doing the operation?

I think zarr might be a good place to start. https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.LRUStoreCache

This store implementation uses a LRU cache over an s3 bucket which might be interesting if aws is hosting data.

hsuominen commented 3 months ago

@hsuominen is the idea that you are loading data that isn't on the computer doing the operation?

yes that's right.

Our intent is to get the data out of proprietary formats and into e.g. zarr (which looks great), but we need to run this extraction on compute that doesn't have the files sitting locally. There are fairly easy workarounds (e.g. using a TempFile) but thought it would be good to get this discussion going as I can see others eventually running into similar needs.

Looking specifically at some of the file formats we are interested in, the changes needed in some cases would be pretty trivial (as @ericpre hinted): https://github.com/hyperspy/rosettasciio/blob/e49911047c03465a84facac546053576a23ef915/rsciio/digitalmicrograph/_api.py#L1278-L1279

but likely harder in others: https://github.com/hyperspy/rosettasciio/blob/e49911047c03465a84facac546053576a23ef915/rsciio/emd/_api.py#L171-L173