FOI: h5_to_json - Githubissues

magland commented 4 years ago

See https://github.com/flatironinstitute/h5_to_json where I put a prelim readme and am adding installation and usage instructions now. You can take a look at the bullet points there.

yarikoptic commented 4 years ago

This one might even be (more) relevant to https://github.com/dandi/dandiarchive itself at some point

magland commented 4 years ago

@yarikoptic and all Check out the example: https://github.com/flatironinstitute/h5_to_json/blob/master/examples/example1.py

This shows how to get a memmap'd numpy array from a dataset in the .h5.json file

ds1 = h5j.get_value(Y['root']['group1']['_datasets']['ds1'], data_dir='test_dir')

The next step is to seamlessly retrieve a np-array-like object that lazily loads data chunks over the network in the case where the underlying data is on a remote machine.

magland commented 4 years ago

The right way to do the remote reading might be to create a file-like python object that points to the data URL. Since the server supports the range-header to retrieve chunks of data, it would seem that there must exist a solution for this. The API would need to allow specification of the chunk size and handle caching internally. Looking for such a tool now.

That file-like object would then be passed into np.fromfile(...)

magland commented 4 years ago

I think I may have found it: https://github.com/intake/filesystem_spec

For example:

with fsspec.open('http://132.249.245.245:24342/get/sha1/97e19b527ecef3d5541210718abece6ac03e9f70?channel=neuro&signature=d92fc8320e310e3ed6793d25c8d16d69814aa19d') as f:
    f.seek(20)
    x = f.read(20)

I'm pretty sure it only downloads file parts as needed but I will need to investigate further.

magland commented 4 years ago

It works very well, I can even specify the block_size. But unfortunately np.fromfile() doesn't seem to be cooperating. I'll need to think some more about it.

magland commented 4 years ago

Even though I have a file-like object to the underlying data on the remote server (permitting lazy loading), I can't find a nice way to feed this to a ndarray-like object. :(

magland commented 4 years ago

I decided to write my own lazy-loader array-like object. It doesn't handle all the cases but it does handle basic 1d and 2d slicing.

Here's the live example: https://colab.research.google.com/drive/13HKWVLZKii_1XTGk1DikhizbJbh7Mggj

And the relevant code:

import kachery as ka
import h5_to_json as h5j
from matplotlib import pyplot as plt

# configure connection to remote kachery server
ka.set_config(
    url='http://132.249.245.245:24342',
    channel='public',
    password='public',
    download_only=True,
    verbose=True
)

# load the .nwb.json object (created via h5_to_json)
X = ka.load_object('sha1://5bfc5d6b06c1b927d63ddde446a9c3079271f5f5/bon03.nwb.json')

# create the hierarchical version for convenience
Xh = h5j.hierarchy(X)

# make a pointer to the large LFP array
lfp_lazy = h5j.get_value(Xh['root']['acquisition']['LFP']['bonlfp-3']['_datasets']['data'], use_kachery=True, lazy=True)

# retrieve a small amount of data from a few channels and plot it
data = lfp_lazy[0:5, 1000000:1000500]
plt.plot(data.T);

spikodrome / administrative

FOI: h5_to_json #1