Append a Dataset of References

mavaylon1 commented 3 months ago

Motivation

What was the reasoning behind this change? Please explain the changes briefly.

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

[x] Did you update CHANGELOG.md with your changes?
[x] Does the PR clearly describe the problem and the solution?
[x] Have you reviewed our Contributing Guide?
[x] Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 88.91%. Comparing base (acc3d78) to head (d5ad0e4). Report is 1 commits behind head on dev.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## dev #1135 +/- ## ========================================== + Coverage 88.89% 88.91% +0.01% ========================================== Files 45 45 Lines 9844 9857 +13 Branches 2799 2802 +3 ========================================== + Hits 8751 8764 +13 Misses 776 776 Partials 317 317 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

rly commented 2 months ago

Minor suggestion to a test. Looks good otherwise.

rly commented 2 months ago

I added a test that raises an unexpected error:

self = <Closed HDF5 file>, name = <HDF5 object reference (null)>

    @with_phil
    def __getitem__(self, name):
        """ Open an object in the file """

        if isinstance(name, h5r.Reference):
            oid = h5r.dereference(name, self.id)
            if oid is None:
>               raise ValueError("Invalid HDF5 object reference")
E               ValueError: Invalid HDF5 object reference

We just chatted in person, but just to note that you were going to take a look at it.

mavaylon1 commented 2 months ago

@rly There may be a work around, but I think the problem below might be enough to just start the proxy idea.

def append(self, arg):
        child = arg
        while True:
            if child.parent is not None:
                parent = child.parent
                child = parent
            else:
                parent = child
                break
        self.io.manager.build(parent)
        builder = self.io.manager.build(arg)

        # Create Reference
        ref = self.io._create_ref(builder)
        append_data(self.dataset, ref)

When a user calls append on a reference, we build the root builder first. We then call _create_ref, which will try to create a reference return self.__file[path].ref. This fails with KeyError: "Unable to open object "new". This fails because it is trying to create a reference to an object, i.e., the new baz, within the file; however, it is not in the file. It is in the root builder.

Why isn't in the file? I am in append mode right? Let's ignore the reference for now, and just add the new baz. It works (sort of). When you read it back, the new baz is not there. We need to call write again. Once you do that it is there. That means when we try to create a reference, and it is looking for the new baz in the file only to find nothing because it is not added till write (which we never call during append).

Earlier I said in conversation that you do not need to call write. That is half true. in my method prior (seen below), you do not need to call write to append to a dataset of references, but you do need to call write to add a new baz because it itself is a new group.

Earlier I had

def append(self, arg):
        # Get Builder
        builder = self.io.manager.build(arg)

        # Get HDF5 Reference
        ref = self.io._create_ref(builder)
        append_data(self.dataset, ref)

This leads to a reference being created, but not found with the test self.assertIs(read_bucket1.baz_data.data[10], read_bucket1.bazs["new"]). This is because the reference path is just \. This is wrong. It needs to be '/bazs/new'.

Note: yes this is the same code from hdmf-zarr. I started to wonder if this could just be in hdmf because the append calls _create_ref which means all we need to do is have unique create_ref methods per backend. AKA we wouldn't need a zarr PR that has this logic, just some name changes probably.

rly commented 2 months ago

I see. Tricky indeed. You can't create an HDF5 reference to an object that isn't in the file yet, and rebuilding the whole hierarchy on each append is not ideal. A proxy makes sense. I can't think of another workaround without severely limiting and documenting the ways in which you cannot append.

Note: yes this is the same code from hdmf-zarr. I started to wonder if this could just be in hdmf because the append calls _create_ref which means all we need to do is have unique create_ref methods per backend. AKA we wouldn't need a zarr PR that has this logic, just some name changes probably.

That sounds useful to look into. You may be able to refactor it and some fields into the base HDMFDataset class.

mavaylon1 commented 1 month ago

Add documentation here: https://github.com/NeurodataWithoutBorders/pynwb/issues/1951

hdmf-dev / hdmf