hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 7 forks source link

[Bug]: ZarrIO cannot resolve Builders #113

Closed mavaylon1 closed 10 months ago

mavaylon1 commented 1 year ago

What happened?

Within HDMF (using HDF5) we wrap a dataset of references. When using NWBHDF5IO (which inherits from HDMFIO), when we read a file, the read method calls read_builder. read_builder takes the wrapped dataset of references and returns builders for each reference. These builders are resolved to containers; however, these containers are not "mapped/pointed" to the object/dataset/attribute within the file until a user makes a call, i.e., if a user calls nwbfile.electrodes[0] (then it will make the connection from "0" index electrode to the correct container).

Within HDMF-Zarr, using ZarrIO, a different approach was taken. On read, we have a dataset of references (unwrapped). ZarrIO takes these references and makes a dataset of builders. These builders are not resolved to containers (not intended).

Steps to reproduce:

cd docs/gallery
python plot_convert_nwb_hdf5.py

with NWBZarrIO(path='test_zarr_sub_anm00239123_ses_20170627T093549_ecephys_and_ogen.nwb.zarr', mode="r") as io:
    infile = io.read()

infile.electrodes.group.data

This will return an array of builders instead of the actual data.

(Notes on connected issue #105) This was discovered when investigating a fix for #105 in which

TypeError: Can't implicitly convert non-string objects to strings

when converting from zarr to hdf5. This is due to the cache being cleared within export(). Why?

  1. (cache cleared): When the cache is cleared, we lose the fact that the dtype is references. As a result, the dtype takes the value of 'object' since we have an array of builders. Within objectmapper we have

    elif np.issubdtype(value.dtype, np.dtype('O')):
                    # Only variable-length strings should ever appear as generic objects.
                    # Everything else should have a well-defined type
                    ret_dtype = 'utf8'

    We are trying to match builders, which is an object, to a string ==> TypeError

  2. (not clearing cache): The dtype is preserved as a reference.

Either way (1 or 2), we get a zarr file that does not resolve the builders to containers. The idea is once that is fixed, the issue from #105 (resulting from the cache) should be fixed (possibly).

Steps to Reproduce

Above.

Traceback

Above.

Operating System

Windows

Python Executable

Conda

Python Version

3.7

Package Versions

No response

Code of Conduct

oruebel commented 1 year ago

In ZarrIO.read_builder we can't resolve object references to Containers because the Containers are constructed later by the BuildManager in HDMFIO.read. To resolve the builders we could:

1) The simplest solution may be to: 1) record all datasets that need to be resolved to Containers during the read_builder phase, and 2) overwrite ZarrIO.read to resolve the builders to containers to create arrays of Containers instead of builders. Advantage: Should be simple to implement. Disadvantage: Requires reading all datasets with object references.

2) Similar to HDF5IO we could wrap the datasets with object references to lazily resolve them to Containers whenever the user reads the data. Advantages: Avoids loading data on read, Disadvantages: This will require more code to implement and will likely require also some refactoring in HDMF to make dynamic resolution of references easier to reuse across backends.

mavaylon1 commented 1 year ago

Recap:

HDF5 (How HDMF works) When reading in a NWBFile, read_builder will take a dataset of references and wrap it (how so?). Only when a user calls a specific container will that container be taken from the wrapped dataset of references where that reference is resolved to the container.

I commented through read (which calls read_builder). At the end of the day, io.read() will return the file (in container form from the file builder). The question that still remains is whether the containers within have been "resolved to containers" or do the remain in builder form until called?

def read(self, **kwargs):
        """Read a container from the IO source."""
        f_builder = self.read_builder() # the builder (dictionary) for the file
        if all(len(v) == 0 for v in f_builder.values()):
            # TODO also check that the keys are appropriate. print a better error message
            raise UnsupportedOperation('Cannot build data. There are no values.')
        container = self.__manager.construct(f_builder) # constructs the the container from the builder (the file)
        container.read_io = self # sets the io object for the container the io instance 
        # skip HERD stuff for clarity
        return container # returns the file

def read_builder(self):
        """
        Read data and return the GroupBuilder representing it.

        NOTE: On read, the Builder.source may will usually not be set of the Builders.
        NOTE: The Builder.location is used internally to ensure correct handling of links (in particular on export)
        and should be set on read for all GroupBuilder, DatasetBuilder, and LinkBuilder objects.
        """
        if not self.__file:
            raise UnsupportedOperation("Cannot read data from closed HDF5 file '%s'" % self.source)
        f_builder = self.__read.get(self.__file) # This gets the builder for the file
        # ignore cached specs when reading builder
        ignore = set()
        specloc = self.__file.attrs.get(SPEC_LOC_ATTR)
        if specloc is not None:
            ignore.add(self.__file[specloc].name)
        if f_builder is None:
            f_builder = self.__read_group(self.__file, ROOT_NAME, ignore=ignore)
            self.__read[self.__file] = f_builder
        return f_builder

Questions:

  1. Wrapped in what way?
  2. When we say resolved, does that mean it goes from reference to builder to container?

HDMF_Zarr In ZarrIO, the read_builder will take a dataset of references and construct builders for those references (stored in an array). These builders are not resolved to containers. Run the following to see it in action.

cd docs/gallery
python plot_convert_nwb_hdf5.py

with NWBZarrIO(path='test_zarr_sub_anm00239123_ses_20170627T093549_ecephys_and_ogen.nwb.zarr', mode="r") as io:
    infile = io.read()

infile.electrodes.group.data

This is me thinking out loud and will continue to dig.

mavaylon1 commented 1 year ago
def construct(self, **kwargs):
        """ Construct the AbstractContainer represented by the given builder """
        builder, build_manager, parent = getargs('builder', 'build_manager', 'parent', kwargs)
        if build_manager is None:
            build_manager = BuildManager(self)
        obj_mapper = self.get_map(builder)
        if obj_mapper is None:
            dt = builder.attributes[self.namespace_catalog.group_spec_cls.type_key()]
            raise ValueError('No ObjectMapper found for builder of type %s' % dt)
        else:
            return obj_mapper.construct(builder, build_manager, parent)

Within construct for BuildManager it calls to get the ObjectMapper. The ObjectMapper's role is to use the spec to map the fields/attrs between the containers and the builders.

Once you have the mapper, we call construct (method in obj_mappper as well). It seems that the construct from obj_mapper return the actual new container.

mavaylon1 commented 1 year ago

Screenshot 2023-08-23 at 8 24 15 PM It seems we recursively construct, which would mean we are recursively building containers for the sub_builders, which means (to me) we aren't wrapping.

oruebel commented 1 year ago

Thanks for investigating @mavaylon1

For datasets, loading of object_references is done here:

https://github.com/hdmf-dev/hdmf-zarr/blob/6c13e14927eea985d53174d8580224c97d65707a/src/hdmf_zarr/backend.py#L1160C16-L1162C79

Which in turn calls:

https://github.com/hdmf-dev/hdmf-zarr/blob/6c13e14927eea985d53174d8580224c97d65707a/src/hdmf_zarr/backend.py#L1175

to resolve the references. At the end of that function the references are being resolved by calling __read_dateset or __read_group, respectively (depending on the which type of object is being referenced).

https://github.com/hdmf-dev/hdmf-zarr/blob/6c13e14927eea985d53174d8580224c97d65707a/src/hdmf_zarr/backend.py#L1206-L1209

When resolving references, the corresponding builder will often have already been read from file, in which case __read_dateset and __read_group will just return the previously read builder (or otherwise make the builder):

https://github.com/hdmf-dev/hdmf-zarr/blob/6c13e14927eea985d53174d8580224c97d65707a/src/hdmf_zarr/backend.py#L1106-L1109

https://github.com/hdmf-dev/hdmf-zarr/blob/6c13e14927eea985d53174d8580224c97d65707a/src/hdmf_zarr/backend.py#L1052-L1055

Either way, this means we now have a dataset of Builders. After ZarrIO.read_builder is complete, HMDFIO.read will then construct Containers for Builders:

https://github.com/hdmf-dev/hdmf/blob/1ca4457de67d979959923cacc7b5ff5f276cca9b/src/hdmf/backends/io.py#L54-L62

I.e., the Containers are being constructed for the Builders since they are being used elsewhere. However, the Dataset of object_references are not being resolved, i.e., we still have a dataset of Builders instead of a dataset of Containers for datasets with references. As I mentioned above:

  1. The simplest solution may be to: 1) record all datasets that need to be resolved to Containers during the read_builder phase, and 2) overwrite ZarrIO.read to resolve the builders to containers to create arrays of Containers instead of builders.

However, while this may be the simplest approach, the disadvantage is that this still requires reading all datasets with object references. To avoid reading the data, it would be useful to:

2. Similar to HDF5IO we could wrap the datasets with object references to lazily resolve them to Containers whenever the user reads the data. Advantages: Avoids loading data on read, Disadvantages: This will require more code to implement and will likely require also some refactoring in HDMF to make dynamic resolution of references easier to reuse across backends.

However, this will be more involved to implement and it may be easiest to do 1. first and then implement 2.

mavaylon1 commented 1 year ago

High Level ZarrIO vs HDF5IO

  1. HDF5IO
    • Read calls read_builder
    • read_builder will create a builder for the file containing builders of all the groups and datasets (recursively). When this is being created, if it comes across a dataset of references, it will be wrapped with BuilderH5ReferenceDataset. You can check this by looking at the finished builder, accessing the the dataset of references and see that it is indeed wrapped.
    • Once read_builder is done and returns the file builder (containing builders), the "regular" objects are resolved to containers with read. However, the wrapped objects are resolved "lazily" only when the user accesses them. For example, in ElectricalSeries, electrodes is a reference that will be resolved when we access it.
  2. ZarrIO ZarrIO does not do this, hence the issue of dataset of references not being resolved from Builders.
    • read calls read_builder
    • read_builder will have an array of builders for references and does not wrap them. This still returns a builder containing the builders of the objects in the file.
    • read will return containers but does not resolve the dataset of builders for references.

The idea is to implement what HDF5IO has into ZarrIO. Start with looking at h5tools read_dataset