rly / h5tojson

Experimental JSON-based format for HDF5 datasets
3 stars 0 forks source link

Support dataset-level caching into standalone HDF5 files #5

Closed rly closed 8 months ago

rly commented 8 months ago

Use case: Occasionally, I want to download part of a very large NWB or other HDF5 file. Such a file contains multiple large datasets, and I want 1) the metadata and 2) one of the large HDF5 datasets in its entirety for fast, local analysis. In addition, it would be nice to be able to download that data quickly and without overloading RAM, read that dataset offline using h5py, keep its chunking and compression properties, and allow me to delete the cached dataset when done. Remfile and fsspec support caching but, as far as I know, to use the cache, I still have to connect to the remote file. I cannot easily read the cached dataset on its own, and if I have read multiple datasets, it's not clear in looking at the cache directory which cache file(s) corresponds to which dataset because they are named with hexstrings.

One solution is to copy the HDF5 dataset to its own standalone HDF5 file, and, within the h5tojson format, store a relative path to the HDF5 file. The standalone HDF5 file, since it is being created anew, could also have a backlink to the original data for provenance.

I'm not sure how useful this feature is, but I thought it would be an interesting experiment in how to take an existing NWB file in the cloud and translating it into JSON metadata plus separate HDF5 files, one for each dataset.

rly commented 8 months ago

Implemented by #6.