hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 7 forks source link

kerchunk considerations #174

Closed magland closed 3 months ago

magland commented 4 months ago

@oruebel @bendichter @rly

Kerchunk has its own way of converting hdf5 to zarr, and I'm wondering if it would make sense to conform to that convention as much as possible. Here's an example of creating a .zarr.json file from a remote .nwb file.

import json
import h5py
import remfile
import kerchunk.hdf

# 000713 - Allen Institute - Visual Behavior - Neuropixels
# https://neurosift.app/?p=/nwb&url=https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/&dandisetId=000713&dandisetVersion=draft
url = "https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/"

# Translate remote hdf5 to local .zarr.json
remf = remfile.File(url, verbose=False)
with h5py.File(remf, 'r') as f:
    h5chunks = kerchunk.hdf.SingleHdf5ToZarr(f, url=url)
    a = h5chunks.translate()
    with open('example1.zarr.json', 'w') as f:
        json.dump(a, f, indent=2)

I am attaching the output file for reference.

example1.zarr.json

As expected, there were some things that couldn't be translated (references and scalar datasets). I did a monkey patch of kerchunk to handle those cases in a way compatible with hdmf-zarr. (The above attached file is actually the result of the monkey patches)

I plan on using kerchunk for various situations. One example I have in mind is to create an augmented NWB file containing content from a dandiset plus additional nwb objects. This is important for building on existing files without needing to download or duplicate the existing content. But there is a problem if kerchunk and hdmf-zarr use different conventions for doing the hdf5->zarr conversion. That's why I'm wondering whether we should conform to using their conventions as much as possible.

... Or maybe it's not important for them to match so long as they are compatible. In any event, I think it's important to document the conventions for doing the conversion (what happens to references, scalar datasets, etc).

bendichter commented 4 months ago

What are the key differences between the way hdmf-zarr is doing the conversion vs. the way kerchunk is doing it?

magland commented 4 months ago

What are the key differences between the way hdmf-zarr is doing the conversion vs. the way kerchunk is doing it?

Here's the first thing I looked at. acquisition/v_in/data/.zattrs

kerchunk version:

{\"_ARRAY_DIMENSIONS\":[\"phony_dim_0\"],\"conversion\":1.0,\"offset\":0.0,\"resolution\":-1.0,\"unit\":\"V\"}

hdmf-zarr version:

{
    "conversion": 1.0,
    "offset": 0.0,
    "resolution": -1.0,
    "unit": "V",
    "zarr_dtype": "float64"
}

So in this case, the difference is that hdmf-zarr adds zarr_dtype whereas kerchunk adds _ARRAY_DIMENSIONS

Regarding "phony_dim_0", I found this in the kerchunk source code

https://github.com/fsspec/kerchunk/blob/6fe1f0aa6d33d856ca416bc13a290e2276d3bdb1/kerchunk/hdf.py#L544-L549

I'll keep looking for other differences.

magland commented 4 months ago

Another difference. In the acquisition/v_in/data/.zarray hdmf-zarr uses 2 chunks whereas kerchunk uses 1 chunk -- the original hdf5 file uses 1 chunk, so I'm not sure why hdmf-zarr is splitting into two (the shape is [53971]). I guess this difference doesn't matter too much... but I am curious about how hdmf-zarr decides about chunking.

Here's the next thing I looked at: In file_create_date/.zattrs:

kerchunk version

{\"_ARRAY_DIMENSIONS\":[\"phony_dim_0\"]}

hdmf-zarr version

{
    "zarr_dtype": "object_"
}
magland commented 4 months ago

Another small difference: For session start time, kerchunk doesn't have any attributes, but hdmf-zarr has zarr_dtype: scalar

Maybe these differences aren't too big. Here are the questions I have.

oruebel commented 4 months ago

but hdmf-zarr has zarr_dtype: scalar

We have a couple of reserved attributes that hdmf-zarr adds; see https://hdmf-zarr.readthedocs.io/en/latest/storage.html#reserved-attributes. The main reason these exist is because Zarr does not support object references and links. As such, we had to implement support for links and references in hdmf-zarr. See the storage model for links and object references for details on how this is implement. I believe we set zarr_dtype attribute in all cases (although I think it is strictly only necessary to identify links and references).

Another difference. In the acquisition/v_in/data/.zarray hdmf-zarr uses 2 chunks whereas kerchunk uses 1 chunk

Kerchunk (as far as I know) only indexes the HDF5 file, i.e., it does not convert the data. As such, kerchunk is (I believe) just representing how the data is layed out in the HDF5 file. In contrast, when converting data with hdmf-zarr we are creating a new Zarr file, so the layout of data can change. This PR https://github.com/hdmf-dev/hdmf-zarr/pull/153 adds functionality to try and preserve the chunking and compression settings that are used in an HDF5 file as much as possible. When chunking is not specified, I believe the Zarr library does its own best guess in how it wants to break the data into chunks. My guess is that the change in chunking from 1 to 2 chunks is probably due to the chunking not being specified during conversion and Zarr making its own guess in how to store it. I think https://github.com/hdmf-dev/hdmf-zarr/pull/153 should fix this for conversion from HDF5 to Zarr.

magland commented 4 months ago

Thanks @oruebel . Somehow I didn't come across those docs. Very helpful.

I think I understand the requirements now, so this issue can be closed.

oruebel commented 4 months ago

I did a monkey patch of kerchunk to handle those cases in a way compatible with hdmf-zarr.

@magland Very interesting. We were also planning to look at kerchunk as an option to use through hdmf-zarr to facilitate remote access to HDF5 files. Would be great to sync up.

oruebel commented 4 months ago

As expected, there were some things that couldn't be translated (references and scalar datasets). I did a monkey patch of kerchunk to handle those cases in a way compatible with hdmf-zarr.

Thanks for sharing the JSON. Could you point me to where a link and object reference is being translated? I didn't see the zarr_dtype anywhere to tell hdmf_zarr to handle the links.

oruebel commented 4 months ago

@rly @sinha-r the discussion here is relevant to the nwb cloud benchmark efforts

oruebel commented 4 months ago

I think I understand the requirements now, so this issue can be closed.

@magland Ok, will close for now. I think it would still be useful to have a chat with @rly and @sinha-r to make sure what we plan to do with kerchunk lines up with your needs.

magland commented 4 months ago

As expected, there were some things that couldn't be translated (references and scalar datasets). I did a monkey patch of kerchunk to handle those cases in a way compatible with hdmf-zarr.

Thanks for sharing the JSON. Could you point me to where a link and object reference is being translated? I didn't see the zarr_dtype anywhere to tell hdmf_zarr to handle the links.

Oops I accidentally shared the version of the file before I did the patches. Here's the one which includes the references.

example1.zarr.json

Search for .specloc and \"target\"

magland commented 4 months ago

I think I understand the requirements now, so this issue can be closed.

@magland Ok, will close for now. I think it would still be useful to have a chat with @rly and @sinha-r to make sure what we plan to do with kerchunk lines up with your needs.

Happy to chat

oruebel commented 4 months ago

Oops I accidentally shared the version of the file before I did the patches. Here's the one which includes the references.

Thanks for sharing this updated version. On read, I think the attributes that have links here will probably not be resolved correctly by hdfm-zarr, because the zarr_dtype is missing. The relevant code for reading object references stored in attributes is here https://github.com/hdmf-dev/hdmf-zarr/blob/556ed123e5cc07a434fd13593c06dca861267b8f/src/hdmf_zarr/backend.py#L1486-L1493 and an example of how the JSON should be formatted for the attribute is here https://hdmf-zarr.readthedocs.io/en/latest/storage.html#storing-object-references-in-attributes

oruebel commented 4 months ago

Happy to chat

Great! We'll reach out to schedule a chat.

magland commented 4 months ago

I made a fork of kerchunk to add what I am calling hdmf_mode parameter

https://github.com/fsspec/kerchunk/compare/main...magland:kerchunk-fork:hdmf

(note this is the hdmf branch of my fork)

It can be tested via

import json
import h5py
import remfile
import kerchunk.hdf

# 000713 - Allen Institute - Visual Behavior - Neuropixels
# https://neurosift.app/?p=/nwb&url=https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/&dandisetId=000713&dandisetVersion=draft
url = "https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/"

# Translate remote hdf5 to local .zarr.json')
remf = remfile.File(url, verbose=False)
with h5py.File(remf, 'r') as f:
    grp = f
    h5chunks = kerchunk.hdf.SingleHdf5ToZarr(grp, url=url, hdmf_mode=True)
    a = h5chunks.translate()
    with open('example1.zarr.json', 'w') as g:
        json.dump(a, g, indent=2)

This gives one warning for a dataset with embedded references. See note in source code.

rly commented 4 months ago

Great! Thank you @magland . I will take a look. I hope that some or all of these changes can be integrated back into kerchunk at some point.

rly commented 4 months ago

Note that this issue may be relevant for handling scalar string datasets: https://github.com/fsspec/kerchunk/issues/387 . The proposed fix in numcodecs has not yet been released.

magland commented 4 months ago

Some notes in advance of our chat.

My forked kerchunk now has a number of customizations. I am not sure what's the best strategy in terms of merging back, but for now I am making adjustments as needed.

Neurosift now supports .zarr.json files generated by (forked) kerchunk. I have a gh action that is iterating through all the nwb assets on dandi and preparing kerchunk files. The expectation of course is that these would all need to be replaced as we work out issues, but I wanted to see how time-consuming the processing would be. I have parallelization in place, etc.

Here's an example that loads really quickly in neurosift: https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/download/

It points to the dandi hdf5 file, but neurosift internally checks the https://kerchunk.neurosift.org bucket for the corresponding .zarr.json file. If it finds it, it uses that. (you can also manually point to any .zarr.json file)

Many of these nwb files have thousands, tens of thousands, or even hundreds of thousands of chunks. It becomes impractical to kerchunk such files when streaming the remote file -- it's typically 1 network request per chunk! There's also a disadvantage of having such a large .zarr.json file because that takes longer to load in neurosift (or other tools). So, for now I have introduced a parameter in kerchunk-fork called num_chunks_per_dataset_threshold, and I have set it to 1000. So, for datasets with >1000 chunks, it doesn't include those directly in the .json file, but instead makes a link to the original nwb file, and the path of the dataset. I also do not process nwb files with >1000 items (some have an excessive number that would cause the .json file to be very large, and again time-consuming to generate).

In my opinion, we should not be creating Zarr datasets with tens of thousands of individual files. With kerchunk it is possible to consolidate this a lot. What I would propose is to have up to 1000 files that are around 1GB in size, and then a single .zarr.json file that references locations within those files. Then you would have all the advantages of efficient zarr storage, but not the excessive number of files. Also, you wouldn't need to put everything in a single .zarr.json, you could spread the references among a hierarchy of those files.

My goal with all this is to be able to augment existing NWB files on DANDI by adding objects without downloading all the original data. Right now, you can create new/derivative NWB files that either duplicate a lot of data from the upstream file or are lacking in that data. Both cases are suboptimal. What the .zarr.json allows us to do is to include the data from the upstream file without downloading or duplicating it. So the flow would be... point to a .zarr.json file as the base nwb file, do some processing (streaming the data), produce new NWB items, add those into a new .zarr.json file that contains both old and new objects as references... and then share the new derivative as a json file. Ideally there would be some mechanism of submitting that to DANDI as well.

To start exploring this possibility, I tried to use fsspec/ReferenceFileSystem, zarr.open, and NWBZarrIO to load a kerchunk'd NWB file. After some adjustments to hdmf_zarr (to allow path to be a zarr.Group), it worked! At least for one example... for other examples I am getting various NWB errors. To give an idea on how this works:

# Note: this only works after some tweaks to hdmf_zarr to allow path to be a zarr.Group

from fsspec.implementations.reference import ReferenceFileSystem
import zarr
from hdmf_zarr.nwb import NWBZarrIO
import pynwb
import remfile
import h5py

# This one seems to load properly
# https://neurosift.app/?p=/nwb&dandisetId=000717&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/3d12a902-139a-4c1a-8fd0-0a7faf2fb223/download/
h5_url = 'https://api.dandiarchive.org/api/assets/3d12a902-139a-4c1a-8fd0-0a7faf2fb223/download/'
json_url = 'https://kerchunk.neurosift.org/dandi/dandisets/000717/assets/3d12a902-139a-4c1a-8fd0-0a7faf2fb223/zarr.json'

# This fails with error: Could not construct ImageSeries object due to: ImageSeries.__init__: incorrect type for 'starting_frame' (got 'int', expected 'Iterable')
# https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/download/
# h5_url = 'https://api.dandiarchive.org/api/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/download/'
# json_url = 'https://kerchunk.neurosift.org/dandi/dandisets/000409/assets/54b277ce-2da7-4730-b86b-cfc8dbf9c6fd/zarr.json'

def load_with_kerchunk():
    fs = ReferenceFileSystem(json_url)
    store = fs.get_mapper(root='/', check=False)
    root = zarr.open(store, mode='r')

    # Load NWB file
    with NWBZarrIO(path=root, mode="r", load_namespaces=True) as io:
        nwbfile = io.read()
        print(nwbfile)
        print('********************************')
        print(nwbfile.acquisition)

def load_with_h5_streaming():
    remf = remfile.File(h5_url)
    h5f = h5py.File(remf, mode='r')
    with pynwb.NWBHDF5IO(file=h5f, mode='r', load_namespaces=True) as io:
        nwbfile = io.read()
        print(nwbfile)
        print('********************************')
        print(nwbfile.acquisition)

if __name__ == "__main__":
    load_with_kerchunk()
    # load_with_h5_streaming()

Even with the data hosted remotely, this loads in a fraction of a second, because all the metadata comes down in one shot.

I thought a bit about how one would write new objects to the loaded file, and I think that is going to require a custom zarr storage backend.

Well that's a lot of info all at once... just thought I'd put it out there in advance of the meeting.

bendichter commented 4 months ago

Amazing! I can't wait to play around with this 😁

magland commented 4 months ago

@bendichter In our meeting we came up with a game plan for how to move forward, and we'll provide updates once we have made some progress on that.

oruebel commented 4 months ago

Thanks for the great discussions! I'm closing this issue for now just for house keeping. Feel free to reopen if necessary.

bendichter commented 4 months ago

@magland sounds great. Would you mind sharing a summary of the plan?

magland commented 4 months ago

@bendichter we're working on something called Linked Data Interface (LINDI). Feel free to comment or contribute over there.