magland / remfile

File-like object from url of remote file, optimized for use with h5py.
Apache License 2.0
7 stars 0 forks source link

Is it thread safe, in particular with caching? #7

Open yarikoptic opened 1 year ago

yarikoptic commented 1 year ago

Sounds like a great replacement for our dandisets-healthchecks goal (underneath datalad-fuse is using fsspec with caching). One of problems we had with fsspec is being to lock everything up in effect making it all single thread...

Another question: how specific it for h5py? Or is it specific for access patterns of hdf5 files?

magland commented 1 year ago

@yarikoptic

Without caching it is thread-safe, but the caching is not thread-safe. I could make it thread-safe using filelock if that is a priority.

Strictly speaking, it is not specific to h5py, but I developed it for that purpose, and it has only been tested with h5py. It simply makes http range requests to get chunks of data. The trick is to decide how large to make those chunks based on the read requests being received by the calling routine. It knows to use larger chunks if it is receiving sequential file read requests. (not sure if that makes sense).

CodyCBakerPhD commented 1 year ago

One of problems we had with fsspec is being to lock everything up in effect making it all single thread...

Did you ever try the resolution to this indicated in this thread? https://github.com/fsspec/filesystem_spec/issues/1298#issuecomment-1609845834

I will point out that if dandisets-healthchecks does not actually leverage caching of streamed content (i.e., re-accessing subslices within chunks previously requested) then remfile will indeed be preferred for your use

But fsspec still has a much larger variety of sophisticated caching mechanisms: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/caching.py

And while remfile has at least he most basic of these: https://github.com/flatironinstitute/neurosift/issues/100#issuecomment-1670088570

I don't think it's necessarily worth it to re-build the wheel of all those other cache types. So I view remfile as the greatly preferred alternative to the ros3 driver for streaming raw bytes, and fsspec as the hook in for achieving partial downloads of a file you often want to re-access multiple times with higher efficiency

magland commented 1 year ago

Another note about remfile caching. For now the system caches MANY relatively small files, which can be time-consuming for cleaning up, or sub-optimal depending on the file system, if downloading a very large file.

yarikoptic commented 1 year ago

I will point out that if dandisets-healthchecks does not actually leverage caching of streamed content (i.e., re-accessing subslices within chunks previously requested) then remfile will indeed be preferred for your use

we use fsspec's caching mechanism at datalad-fuse level (https://github.com/datalad/datalad-fuse/blob/master/datalad_fuse/fsspec.py#L44) so we could also have efficient access to those .nwb files from matlab or anything else if use case comes. But ATM we use really crude locking to make it all work since IIRC fsspec index goes across all URLs (not per URL) so parallel access across nwb files is pretty much "serial".

For now the system caches MANY relatively small files, which can be time-consuming for cleaning up, or sub-optimal depending on the file system, if downloading a very large file.

many small files for a single large .nwb? (didn't try/look inside yet)