Write references in compound datasets at the end

mavaylon1 commented 8 months ago

Motivation

What was the reasoning behind this change? Please explain the changes briefly. Fix #144

Notes: The fix at face value is the following:

for j, item in enumerate(data):
        new_item = list(item)
        for i in refs:
             new_item[i] = self.__get_ref(item[i], export_source=export_source)
        new_items.append(tuple(new_item))

        generated_dtype = []
        for item in builder.dtype:
              if item['dtype'] == type('string'):
                  item['dtype'] = 'U25'
              generated_dtype.append((item['name'], item['dtype']))

 dset[...] = np.array(new_items, dtype=generated_dtype)

TLDR: We want to write outside of the the loop. What does this lead to?

To keep our compound dataset in its original form (i.e not expanded) we can use a structured array an pull the dtypes from the builder. However, it is required that the "data" in the array be in tuples and not lists (I believe it can also be in np.arrays but I have not tested this).

Having these as tuples makes things a bit hard when retrieving data on read. For example, in the test test_read_reference_compound

def test_read_reference_compound(self):
        self.test_write_reference_compound()
        self.read()
        builder = self.createReferenceCompoundBuilder()['ref_dataset']
        read_builder = self.root['ref_dataset']

If I try to run read_builder['data'][0] I will get an error that *** TypeError: 'tuple' object does not support item assignment. Digging deeper it is because our get_item swaps/will reassign values in the tuple. This is not allowed because it is a tuple. Now I can convert that to a list on _get_item_ and that does work. However, it is a little hacky. It also doesn't cover when my index is ":" vs a single integer.

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

[ ] Did you update CHANGELOG.md with your changes?
[ ] Have you checked our Contributing document?
[ ] Have you ensured the PR clearly describes the problem and the solution?
[ ] Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
[ ] Have you checked to ensure that there aren't other open Pull Requests for the same change?
[ ] Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

mavaylon1 commented 8 months ago

@oruebel As I was going through the tests, I saw that the way I had to interact with the read data differed from before. Diving deeper it comes down to what I said above where we are now using tuples, which cannot change. The fix (even if it is a little hacky) is to convert to a list when trying to use index notation to "get". I wanted your thoughts on the matter and I also think this may need to go all the way up the parent hierarchy to HDMFDataset.

oruebel commented 8 months ago

Digging deeper it is because our get_item swaps/will reassign values in the tuple.

Could you please point me to the lines of code where the tuple is being modified on read so I can take a look to better understand what is going. I'm not sure right now why the tuples are being modified on read (maybe to resolve the references).

mavaylon1 commented 8 months ago

Digging deeper it is because our get_item swaps/will reassign values in the tuple.

Could you please point me to the lines of code where the tuple is being modified on read so I can take a look to better understand what is going. I'm not sure right now why the tuples are being modified on read (maybe to resolve the references).

Lines

sneakers-the-rat commented 8 months ago

I was just going to work on https://github.com/hdmf-dev/hdmf-zarr/pull/146 and saw this parallel PR, should I work on this instead or what

sneakers-the-rat commented 8 months ago

Wrote this here: https://github.com/hdmf-dev/hdmf-zarr/pull/146#issuecomment-1855049073

but that tuple problem is just from not setting the dtype in the zarr dataset, so it ignores the dtype you give to np.array and just stores as tuples. If you correctly set that then it behaves just fine.

So do this: https://github.com/hdmf-dev/hdmf-zarr/blob/8100341512324af25718ab93dc7fc54b68f03469/src/hdmf_zarr/backend.py#L1044

instead of this: https://github.com/hdmf-dev/hdmf-zarr/blob/005bfbc0eb3ecee884170ec2a93b7d20c63c9655/src/hdmf_zarr/backend.py#L1016

mavaylon1 commented 8 months ago

I was just going to work on #146 and saw this parallel PR, should I work on this instead or what

Yes please

mavaylon1 commented 8 months ago

read_builder['data'][0]

Interesting. Could you explain more?

mavaylon1 commented 8 months ago

Wrote this here: #146 (comment)

but that tuple problem is just from not setting the dtype in the zarr dataset, so it ignores the dtype you give to np.array and just stores as tuples. If you correctly set that then it behaves just fine.

So do this:

https://github.com/hdmf-dev/hdmf-zarr/blob/8100341512324af25718ab93dc7fc54b68f03469/src/hdmf_zarr/backend.py#L1044

instead of this:

https://github.com/hdmf-dev/hdmf-zarr/blob/005bfbc0eb3ecee884170ec2a93b7d20c63c9655/src/hdmf_zarr/backend.py#L1016

Oh it stores it as <class 'numpy.void'> when you provide a dtype instead of a tuple.

sneakers-the-rat commented 8 months ago

Yes please

Alright, lmk how - i don't have commit permissions on this repo so can't modify this branch. Usually bad etiquette to take the several hours of work i've done here and just rewrite it without credit, but whatevs.

Oh it stores it as <class 'numpy.void'> when you provide a dtype instead of a tuple.

right, and that's the desired behavior for storing a structured array, which is what I thought the goal was. As this PR is written, the constructed dtype is just ignored and the array is just stored as serialized tuples.

mavaylon1 commented 8 months ago

Yes please

Alright, lmk how - i don't have commit permissions on this repo so can't modify this branch. Usually bad etiquette to take the several hours of work i've done here and just rewrite it without credit, but whatevs.

Oh it stores it as <class 'numpy.void'> when you provide a dtype instead of a tuple.

right, and that's the desired behavior for storing a structured array, which is what I thought the goal was. As this PR is written, the constructed dtype is just ignored and the array is just stored as serialized tuples.

I believe we were working on this in parallel. We can just keep it to your branch on your fork to give you credit. The intent was for me to dig around from the last checkpoint I saw on yours. Let me know when you're done and I'll review it. I'll continue to play around here for educational purposes.

sneakers-the-rat commented 8 months ago

ya sorry i don't mean to be prickly but this has happened a number of times to me lol and usually i don't care, but when last looking for a job it actually was important to be able to demonstrate contribs. I think the patch is done except for tests and docs?

mavaylon1 commented 8 months ago

ya sorry i don't mean to be prickly but this has happened a number of times to me lol and usually i don't care, but when last looking for a job it actually was important to be able to demonstrate contribs. I think the patch is done except for tests and docs?

I can check the tests, but I will be out of town starting tomorrow. I'll go through it will most likely finalize this for a merge by Tuesday.

sneakers-the-rat commented 8 months ago

lmk what else needs to be written re: docs and tests, happy to do that! i try to be as labor-neutral as i can be when raising issues/PRs :). take ur time, no rush. also sorry for my tone earlier, frustrating day.

rly commented 5 months ago

@mavaylon1 can this be closed, now that #146 has been merged?

hdmf-dev / hdmf-zarr