hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 7 forks source link

[Bug]: Converting HDF5 to Zarr with unlimted dimensions fails #154

Closed oruebel closed 7 months ago

oruebel commented 7 months ago

What happened?

When converting an HDF5 dataset to Zarr where the HDF5 dataset has an unlimited dimension, the data_shape will include None values for the unlimited dimensions when get_data_shape is being called"

https://github.com/hdmf-dev/hdmf-zarr/blob/71bfa6d7d66223227527463c3f829025fa7124fd/src/hdmf_zarr/backend.py#L1195-L1196

This is because hdmf.utils.get_data_shape looks at the maxshape of the data first rather than returning the actual shape https://github.com/hdmf-dev/hdmf/blob/8892ba6472d9751ff44b977a1d5a50e6512d3b05/src/hdmf/utils.py#L902-L903

However, the data shape when creating a new dataset in Zarr cannot contain None values, i.e., it must be a tuple of only integer values.

Steps to Reproduce

# Download sub-1214579789/sub-1214579789_ses-1214621812_icephys.nwb (asset URL 
# https://api.dandiarchive.org/api/assets/1142cd85-96f9-4ae2-9d71-f722887fb3c0/download/) 
# from  DANDI and convert with:

from pynwb import NWBHDF5IO
from hdmf_zarr.nwb import NWBZarrIO
import sys

def print_help():
    print("The script requires exactly one arguments. Call with")
    print("python nwb_hdf_to_zarr.py <hdf5_filename>")
    print("")
    print("The output zarr file will be named the same as the HDF5 file with the added .zarr extension")

def convert(hdf_filename: str, zarr_filename: str):
    with NWBHDF5IO(hdf_filename, 'r', load_namespaces=True) as read_io:  # Create HDF5 IO object for read
        with NWBZarrIO(zarr_filename, mode='w') as export_io:         # Create Zarr IO object for write
            export_io.export(src_io=read_io, write_args=dict(link_data=False))   # Export from HDF5 to Zarr

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print_help()
        exit(0)
    else:
        hdf_filename = sys.argv[1]
        zarr_filename = hdf_filename +".zarr"
        convert(hdf_filename=hdf_filename, zarr_filename=zarr_filename)

Traceback

See above

Operating System

Linux

Python Executable

Conda

Python Version

3.11

Package Versions

No response

Code of Conduct

oruebel commented 7 months ago

@rly I'll submit a PR with a fix for hdmf_zarr, however, I'm wondering whether a more approbriat fix would be to update the behavior of get_data_shape in hdmf to make sure we don't return shapes that include None values. However, I do not recall whether that is intended behavior of get_data_shape and why this may be needed. Alternatively, we could also add a flag to get_data_shape to allow use to configure whether None values are allowed in the returned shape or not. Thoughts?