Open magland opened 4 years ago
This one might even be (more) relevant to https://github.com/dandi/dandiarchive itself at some point
@yarikoptic and all Check out the example: https://github.com/flatironinstitute/h5_to_json/blob/master/examples/example1.py
This shows how to get a memmap'd numpy array from a dataset in the .h5.json file
ds1 = h5j.get_value(Y['root']['group1']['_datasets']['ds1'], data_dir='test_dir')
The next step is to seamlessly retrieve a np-array-like object that lazily loads data chunks over the network in the case where the underlying data is on a remote machine.
The right way to do the remote reading might be to create a file-like python object that points to the data URL. Since the server supports the range-header to retrieve chunks of data, it would seem that there must exist a solution for this. The API would need to allow specification of the chunk size and handle caching internally. Looking for such a tool now.
That file-like object would then be passed into np.fromfile(...)
I think I may have found it: https://github.com/intake/filesystem_spec
For example:
with fsspec.open('http://132.249.245.245:24342/get/sha1/97e19b527ecef3d5541210718abece6ac03e9f70?channel=neuro&signature=d92fc8320e310e3ed6793d25c8d16d69814aa19d') as f:
f.seek(20)
x = f.read(20)
I'm pretty sure it only downloads file parts as needed but I will need to investigate further.
It works very well, I can even specify the block_size. But unfortunately np.fromfile() doesn't seem to be cooperating. I'll need to think some more about it.
Even though I have a file-like object to the underlying data on the remote server (permitting lazy loading), I can't find a nice way to feed this to a ndarray-like object. :(
I decided to write my own lazy-loader array-like object. It doesn't handle all the cases but it does handle basic 1d and 2d slicing.
Here's the live example: https://colab.research.google.com/drive/13HKWVLZKii_1XTGk1DikhizbJbh7Mggj
And the relevant code:
import kachery as ka
import h5_to_json as h5j
from matplotlib import pyplot as plt
# configure connection to remote kachery server
ka.set_config(
url='http://132.249.245.245:24342',
channel='public',
password='public',
download_only=True,
verbose=True
)
# load the .nwb.json object (created via h5_to_json)
X = ka.load_object('sha1://5bfc5d6b06c1b927d63ddde446a9c3079271f5f5/bon03.nwb.json')
# create the hierarchical version for convenience
Xh = h5j.hierarchy(X)
# make a pointer to the large LFP array
lfp_lazy = h5j.get_value(Xh['root']['acquisition']['LFP']['bonlfp-3']['_datasets']['data'], use_kachery=True, lazy=True)
# retrieve a small amount of data from a few channels and plot it
data = lfp_lazy[0:5, 1000000:1000500]
plt.plot(data.T);
See https://github.com/flatironinstitute/h5_to_json where I put a prelim readme and am adding installation and usage instructions now. You can take a look at the bullet points there.