Open yarikoptic opened 1 year ago
@yarikoptic
Without caching it is thread-safe, but the caching is not thread-safe. I could make it thread-safe using filelock if that is a priority.
Strictly speaking, it is not specific to h5py, but I developed it for that purpose, and it has only been tested with h5py. It simply makes http range requests to get chunks of data. The trick is to decide how large to make those chunks based on the read requests being received by the calling routine. It knows to use larger chunks if it is receiving sequential file read requests. (not sure if that makes sense).
One of problems we had with fsspec is being to lock everything up in effect making it all single thread...
Did you ever try the resolution to this indicated in this thread? https://github.com/fsspec/filesystem_spec/issues/1298#issuecomment-1609845834
I will point out that if dandisets-healthchecks does not actually leverage caching of streamed content (i.e., re-accessing subslices within chunks previously requested) then remfile
will indeed be preferred for your use
But fsspec
still has a much larger variety of sophisticated caching mechanisms: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/caching.py
And while remfile
has at least he most basic of these: https://github.com/flatironinstitute/neurosift/issues/100#issuecomment-1670088570
I don't think it's necessarily worth it to re-build the wheel of all those other cache types. So I view remfile
as the greatly preferred alternative to the ros3
driver for streaming raw bytes, and fsspec
as the hook in for achieving partial downloads of a file you often want to re-access multiple times with higher efficiency
Another note about remfile caching. For now the system caches MANY relatively small files, which can be time-consuming for cleaning up, or sub-optimal depending on the file system, if downloading a very large file.
I will point out that if dandisets-healthchecks does not actually leverage caching of streamed content (i.e., re-accessing subslices within chunks previously requested) then
remfile
will indeed be preferred for your use
we use fsspec's caching mechanism at datalad-fuse level (https://github.com/datalad/datalad-fuse/blob/master/datalad_fuse/fsspec.py#L44) so we could also have efficient access to those .nwb files from matlab or anything else if use case comes. But ATM we use really crude locking to make it all work since IIRC fsspec index goes across all URLs (not per URL) so parallel access across nwb files is pretty much "serial".
For now the system caches MANY relatively small files, which can be time-consuming for cleaning up, or sub-optimal depending on the file system, if downloading a very large file.
many small files for a single large .nwb? (didn't try/look inside yet)
Sounds like a great replacement for our dandisets-healthchecks goal (underneath datalad-fuse is using fsspec with caching). One of problems we had with fsspec is being to lock everything up in effect making it all single thread...
Another question: how specific it for h5py? Or is it specific for access patterns of hdf5 files?