hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 7 forks source link

[Feature]: write `xarray`-compatible Zarr files #176

Open alejoe91 opened 4 months ago

alejoe91 commented 4 months ago

What would you like to see added to HDMF-ZARR?

Xarray supports the Zarr backend, but requires the _ARRAY_DIMENSIONS attribute to be set with a list of names for the array dimensions (e.g. [samples, channels]) - see https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html#zarr-encoding

It would be great to add these attributes as default for known data types (e.g. ElectricalSeries)

@jsiegle

Is your feature request related to a problem?

NWB-Zarr files cannot be opened by xarray.open_zarr

What solution would you like?

Adding the _ARRAY_DIMENSIONS attributes to all "known" neurodata_types

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

bendichter commented 4 months ago
from xarray import DataArray
import numpy as np
from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from h5py import Dataset
from pynwb import get_type_map
import json

dset_types = (np.ndarray, Dataset)  # etc.

def get_dimension_labels(cls, ndims, dataset_name):
    spec = get_type_map().namespace_catalog.get_spec(cls.namespace, cls.neurodata_type)
    data = next(x for x in spec["datasets"] if x["name"] == dset_name)
    dims = data["dims"]

    if isinstance(dims[0], str):  # only one shape spec
        return dims

    for i_dims in  dims:
        if len(i_dims) == ndims:
            return i_dims

def load_dset_as_xarray(obj, dset_name):
    dset = obj.fields[dset_name]
    cls = obj.__class__
    ndims = len(dset.shape)

    dim_labels = get_dimension_labels(cls, ndims, dset_name)

    coords = dict(num_channels=electrical_series.electrodes.data)
    if obj.timestamps is not None:
        coords.update(num_times=obj.timestamps)

    attrs = {k: v for k, v in obj.fields.items() if not isinstance(v, dset_types)}
    return DataArray(dset, dims=dim_labels, coords=coords, attrs=attrs)

electrical_series = mock_ElectricalSeries(timestamps=np.arange(10), rate=None)

load_dset_as_xarray(electrical_series, "data")
bendichter commented 4 months ago

^ Code that solves a related problem and might be helpful

oruebel commented 4 months ago

^ Code that solves a related problem and might be helpful

I think this could be useful to have in some form available in PyNWB. Maybe as a utility method and/or as a method on TimeSeries, since a common use-case for this is probably representing TimeSeries.data as xarray.

oruebel commented 4 months ago

It would be great to add these attributes as default for known data types (e.g. ElectricalSeries)

Adding the _ARRAY_DIMENSIONS attribute for cases where we know the dimensions seems like a good idea 👍

oruebel commented 4 months ago

In terms of implementation, I think this will require changes in HDMF as well. Here a rough plan of how this could be implemented:

  1. Add dimension_labels as an attribute on the DatasetBuilder(which may be None of the labels are unknown) https://github.com/hdmf-dev/hdmf/blob/5c8506216995f995b891da1e6b596ee42b7dd948/src/hdmf/build/builders.py#L321
  2. Enhance BuildManger.build to set the dimension_labels for DatasetBuildershttps://github.com/hdmf-dev/hdmf/blob/5c8506216995f995b891da1e6b596ee42b7dd948/src/hdmf/build/manager.py#L148
  3. Update ZarrIO.write_dataset to add the _ARRAY_DIMENSIONS to the attributes of the dataset if builder.dimension_labels are present.

https://github.com/hdmf-dev/hdmf-zarr/blob/6e946dabc2370bb428787fbe7d1ec919ba1fd11f/src/hdmf_zarr/backend.py#L954

@rly does that plan sound reasonable or what this also require changes in the ObjectMapper to determine the dimensions there instead of in the BuildManager?

rly commented 4 months ago

@rly does that plan sound reasonable or what this also require changes in the ObjectMapper to determine the dimensions there instead of in the BuildManager?

That sounds reasonable, except that all the building / creation of DatasetBuilder objects happen in the ObjectMapper. We'd probably do it in __add_attributes or the constructor.

oruebel commented 4 months ago

@mavaylon1 could you take a look at this?

rly commented 4 months ago

I'll work on the HDMF side

mavaylon1 commented 4 months ago

Can do