[Bug]: NWB Zarr to HDMF export fails

rcpeene commented 1 month ago

What happened?

Trying to export a zarr nwb as hdmf, but it yields an error

Steps to Reproduce

Running the following snippet on an nwb file

    with NWBZarrIO(str(zarr_filename), mode='r') as read_io:  # Create Zarr IO object for read
        with NWBHDF5IO(hdmf_filename, 'w') as export_io:  # Create HDF5 IO object for write
            export_io.export(src_io=read_io, write_args=dict(link_data=False))  # Export from Zarr to HDF5

I can't share the nwb file for licensing reasons


### Traceback

```shell
/opt/conda/lib/python3.9/site-packages/hdmf/common/table.py:489: UserWarning: An attribute 'name' already exists on DynamicTable 'eye_tracking' so this column cannot be accessed as an attribute, e.g., table.name; it can only be accessed using other methods, e.g., table['name'].
  self.__set_table_attr(col)
Traceback (most recent call last):
  File "/root/capsule/./code/run_capsule.py", line 57, in <module>
    if __name__ == "__main__": run()
  File "/root/capsule/./code/run_capsule.py", line 47, in run
    export_io.export(src_io=read_io, write_args=dict(link_data=False))  # Export from Zarr to HDF5
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/pynwb/__init__.py", line 399, in export
    super().export(**kwargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 458, in export
    super().export(**ckwargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/io.py", line 166, in export
    self.write_builder(builder=bldr, **write_args)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 836, in write_builder
    self.write_group(self.__file, gbldr, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1018, in write_group
    self.write_group(group, sub_builder, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1023, in write_group
    self.write_dataset(group, sub_builder, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1326, in write_dataset
    dset = self.__list_fill__(parent, name, data, options)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1492, in __list_fill__
    raise e
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1490, in __list_fill__
    dset[:] = data
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/opt/conda/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 898, in __setitem__
    val = numpy.asarray(val, dtype=dtype.base, order='C')
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 579, in __array__
    a = self[...]
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 800, in __getitem__
    result = self.get_basic_selection(pure_selection, fields=fields)
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 926, in get_basic_selection
    return self._get_basic_selection_nd(selection=selection, out=out, fields=fields)
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 968, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 1343, in _get_selection
    self._chunk_getitems(
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 2183, in _chunk_getitems
    self._process_chunk(
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 2096, in _process_chunk
    chunk = self._decode_chunk(cdata)
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 2366, in _decode_chunk
    chunk = chunk.view(self._dtype)
ValueError: When changing to a smaller dtype, its size must be a divisor of the size of original dtype

Operating System

Linux

Python Executable

Python

Python Version

3.9

Package Versions

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct
[X] Have you checked the Contributing document?
[X] Have you ensured this bug was not already reported?

oruebel commented 1 month ago

Thanks for including the code and traceback. The issue appears to be due to some conversion between data types when exporting from Zarr to HDF5:

ValueError: When changing to a smaller dtype, its size must be a divisor of the size of original dtype

This error originates from here in the HDMF library when writing to disk:

File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1490, in __list_fill__ dset[:] = data

I can't share the nwb file for licensing reasons

Since you can't share the original data file, we'll probably need your help to get to the root of this.

Option 1 would be, if you could share a "dummy" file that has the same issue, then we could investigate, i.e., we don't really need the real data to debug, but some some file that looks similar and raises the error should be fine.

Option 2 is to do a bit more retracing of steps on your end so we can at least figure out what case causes it so that we can reproduce the issue on our end. A first step here would be to output all the properties of the dataset and data when the exception occurs, e.g., by adding a print statement here before the exception is being raised in line 1492, something along the lines of print("parent", parent, "\n", "name", name, "\n", "dset", dset, "\n", "dset.dtype", dset.dtype, "\n" , "data.dtype", data.dtype, "\n" "data", data). So that we can see what data types are being converted.

rcpeene commented 1 month ago

@oruebel I've received permission to share the file directly with you for examination as long as it isn't distributed. Would a onedrive link work?

oruebel commented 1 month ago

I've received permission to share the file directly with you for examination as long as it isn't distributed. Would a onedrive link work?

Sure, a onedrive link should be fine. Feel free to send via Slack or email oruebel@lbl.gov so we can take a look. We'll treat the data confidentially and not share with others.

rcpeene commented 1 month ago

invite email sent

rcpeene commented 4 weeks ago

Any updates here? It's one of the last things holding up our data pipeline.

oruebel commented 4 weeks ago

As far as I can tell, the issue seems to occur when copying /intervals/flash_block_presentations/tags. My guess is that this is that this is likely due the following:

Zarr does not natively support variable length strings, but strings are stored as object dtype with an encoding
My guess is that HDF5IO in turn gets confused and chooses the wrong dtype

I'll need to do a bit more digging to confirm. My guess is that the fix will likely need to be in HDMF. A possible workaround may be to wrap /intervals/flash_block_presentations/tags with H5DataIO before calling export to explicitly set the dtype, but I have not tested this yet.

oruebel commented 4 weeks ago

What is confusing to me is that when printing from HDF5IO it shows <zarr.core.Array '/intervals/flash_block_presentations/tags' (1011,) <U0 read-only> but when opening the file with Zarr manually it shows <zarr.core.Array '/intervals/flash_block_presentations/tags' (1011,) object read-only>but I'm not sure why the dtype would be <U0 instead of object. It looks that because of this, it is actually reading the data from Zarr itself that is failing.

oruebel commented 4 weeks ago

It appears the issue is that ObjectMapper in HDMF uses .astype('U') to enforce that the dtype of the dataset is unicode as specified in the schema. For Zarr datasets this fails because Zarr does not support 'U' as a dtype for variable length string.

I submitted a PR on HDMF here https://github.com/hdmf-dev/hdmf/pull/1171 for this. With this change I was able to convert the file to HDF5.

hdmf-dev / hdmf-zarr