Write references in compound datasets as an array

sneakers-the-rat commented 9 months ago

Motivation

Fix https://github.com/hdmf-dev/hdmf-zarr/issues/144

Writes references in compound datasets as an same-sized array, rather than iteratively as an array of lists

How to test the behavior?

Convert a dataset with compound datasets, confirm that it doesn't take a really long time.

Making a more detailed example blocked for me by https://github.com/hdmf-dev/hdmf-zarr/issues/145 bc i am not sure if i should actually work on this since it seems like devs are just doing their own version here (???) https://github.com/hdmf-dev/hdmf-zarr/pull/149

Checklist

very tired and so leaving this as a draft for now

[ ] Did you update CHANGELOG.md with your changes?
[ ] Have you checked our Contributing document?
[x] Have you ensured the PR clearly describes the problem and the solution?
[x] Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
[x] Have you checked to ensure that there aren't other open Pull Requests for the same change?
[x] Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

sneakers-the-rat commented 8 months ago

Implemented writing as a compound dset, I used the information from the options passed to the function rather than using info from the builder class to reduce coupling, but if y'all prefer it that way it's a simple drop-in fix.

So just after the dset is written...

>>> (Pdb) dset[0]
(3798, 1, {'source': '.', 'path': '/processing/stimulus/timestamps', 'object_id': 'ff28f35d-9211-4bdb-b026-ad27d46367a3', 'source_object_id': 'a695c6dd-5c46-428d-a927-5c2027c4b653'})
>>> (Pdb) dset.dtype
dtype([('idx_start', '<i4'), ('count', '<i4'), ('timeseries', 'O')])

This version also doesn't have the same problem as https://github.com/hdmf-dev/hdmf-zarr/pull/149 , since the dataset itself is given the dtype, the returned type is a numpy void type rather than a tuple:

>>> (Pdb) type(dset[0])
<class 'numpy.void'>
>>> (Pdb) dset[0][1] = 5
>>> (Pdb) # (it works)

whereas if the zarr dataset is created with dtype=object...

>>> (Pdb) type(dset[0])
<class 'tuple'>
>>> (Pdb) dset[0][1] = 5
*** TypeError: 'tuple' object does not support item assignment

all tests pass. not sure what tests i would need to add here for this specific behavior but lmk

codecov-commenter commented 8 months ago

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (f5f09c8) 85.16% compared to head (b33301c) 85.59%.

Files	Patch %	Lines
src/hdmf_zarr/backend.py	87.50%	1 Missing and 1 partial :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## dev #146 +/- ## ========================================== + Coverage 85.16% 85.59% +0.42% ========================================== Files 5 5 Lines 1146 1159 +13 Branches 291 295 +4 ========================================== + Hits 976 992 +16 + Misses 114 110 -4 - Partials 56 57 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mavaylon1 commented 8 months ago

@sneakers-the-rat I'll take a look at the code coverage tonight, but from a quick glance it is passing.

mavaylon1 commented 8 months ago

@sneakers-the-rat Looks good to me. I can update the changelog for you if you'd like. After that, I'll do one last review and approve it.

sneakers-the-rat commented 8 months ago

@mavaylon1 why don't you handle the changelog, since there's probably a way y'all like to write it/organize it with whatever other PRs have been going on, etc. :)

lemme try to write a test, even if it's just as simple as testing that the write ends up in the right format (idk seems like a test for how long the operation took would be pretty flaky)

mavaylon1 commented 8 months ago

@mavaylon1 why don't you handle the changelog, since there's probably a way y'all like to write it/organize it with whatever other PRs have been going on, etc. :)

lemme try to write a test, even if it's just as simple as testing that the write ends up in the right format (idk seems like a test for how long the operation took would be pretty flaky)

Looking at the coverage, no further test is needed.

sneakers-the-rat commented 8 months ago

Added a quick test that the array was written as a compound array, lmk if that's not the right way to test that - hardcoded values linked to the test dataset values, but seemed like the best way to differentiate that v. tuple behavior given the way the rest of the test is written. it also tests both the special cases for storage of objects and strings with O and U25

edit omg reading the rest of the tests i think i wrote this in the wrong place one sec edit2: (moved to comment below)

mavaylon1 commented 8 months ago

Added a quick test that the array was written as a compound array, lmk if that's not the right way to test that - hardcoded values linked to the test dataset values, but seemed like the best way to differentiate that v. tuple behavior given the way the rest of the test is written. it also tests both the special cases for storage of objects and strings with O and U25

edit omg reading the rest of the tests i think i wrote this in the wrong place one sec

So this should already be tested with existing code, that's why it is passing coverage prior.

sneakers-the-rat commented 8 months ago

moving this out of edit so it stays in sequence: ok i don't see a better place for this, but can revert if not wanted. coverage should we the same since all the legs of the changed code are reached in the existing tests, but i don't see an explicit test for the dtype of the representation in the stored zarr array, so figured i would at least offer a test of the thing that i changed :)

rly commented 8 months ago

@sneakers-the-rat Thanks for the updates and test!

Looking at the coverage, no further test is needed.

@mavaylon1 Code coverage measures whether every line of the codebase has been run by the test suite. It doesn't reflect whether the codebase does what it is supposed to do and whether the new code correctly fixes the issue. In general, when an issue is found, a test (i.e., an assert) should be added that reproduces the issue and fails in the current main branch. The new code in the PR should make that test go from failing to passing. So here, even though all of the new code was touched by the test suite, a test should still be added to make sure the issue has actually been fixed and also make sure that any future changes to the codebase will not cause the same issue to arise.

sneakers-the-rat commented 8 months ago

In this case the test also revealed a failure on windows, which is fun. Lemme debug

mavaylon1 commented 8 months ago

@sneakers-the-rat Thanks for the updates and test!

Looking at the coverage, no further test is needed.

@mavaylon1 Code coverage measures whether every line of the codebase has been run by the test suite. It doesn't reflect whether the codebase does what it is supposed to do and whether the new code correctly fixes the issue. In general, when an issue is found, a test (i.e., an assert) should be added that reproduces the issue and fails in the current main branch. The new code in the PR should make that test go from failing to passing. So here, even though all of the new code was touched by the test suite, a test should still be added to make sure the issue has actually been fixed and also make sure that any future changes to the codebase will not cause the same issue to arise.

@rly I'm aware. My point is the fact that there already should be a test that checks compound data is written as compound data. If not, then I understand adding one. The changes here, in practice, shouldn't break any tests because the results should be the same. Checking if the data was written as a compound data correctly should be the same of the tests that already check compound data (again these should already exist). From my understanding, the issue isn't a big but an optimization problem. The "fix" should be already covered by existing tests.

sneakers-the-rat commented 8 months ago

From what i could tell, the checks that exist do pass, but they don't check explicitly for storage format. eg. the adjoining tests https://github.com/sneakers-the-rat/hdmf-zarr/blob/63922aea2339f64b96803394217498b833afc709/tests/unit/base_tests_zarrio.py#L444 that do self.assertEqual(v[0], builder['data'][i][0]) would work for both compound dtypes and for object type arrays that store a 1D array of tuples since the indexing is the same in both cases.

obviously y'all have a better sense of the tests that exist than I do, so if there already is one then great! I don't really care about the inclusion of the one crappy lil test I added (or even really the storage format), my main goal here is to just help out making writing happen at a reasonable speed, so lmk what's good <3

edit: sorry since there is a failing test - i don't have a windows machine to debug, but i can just spam warnings as print statements in the GH actions until it works lol. if y'all have windows machines and debug before i can get to it, then u win the debugging race!

mavaylon1 commented 8 months ago

From what i could tell, the checks that exist do pass, but they don't check explicitly for storage format. eg. the adjoining tests https://github.com/sneakers-the-rat/hdmf-zarr/blob/63922aea2339f64b96803394217498b833afc709/tests/unit/base_tests_zarrio.py#L444

that do self.assertEqual(v[0], builder['data'][i][0]) would work for both compound dtypes and for object type arrays that store a 1D array of tuples since the indexing is the same in both cases.

obviously y'all have a better sense of the tests that exist than I do, so if there already is one then great! I don't really care about the inclusion of the one crappy lil test I added (or even really the storage format), my main goal here is to just help out making writing happen at a reasonable speed, so lmk what's good <3

edit: sorry since there is a failing test - i don't have a windows machine to debug, but i can just spam warnings as print statements in the GH actions until it works lol. if y'all have windows machines and debug before i can get to it, then u win the debugging race!

I'll look after the break. By all means if there's a gap in the test we should fill it. My point was based on the assumption that there's tests that already do what we want because this isn't a format change. Just an order of operations. But if this process "exposed" a gap, then let's fill it.

mavaylon1 commented 8 months ago

@sneakers-the-rat I updated the test to our usual use of TestCase "assertEqual". This gave more insight on the issue for windows: AssertionError: dtype([('id', '<i4'), ('name', '<U25'), ('reference', 'O')]) != dtype([('id', '<i8'), ('name', '<U25'), ('reference', 'O')])

The id types are not matching.

sneakers-the-rat commented 8 months ago

ope. so windows treats np.dtype('int') as 32-bits by default i guess, good to know.

looking to see if there was an explicit type map, i can see that there is and that I missed it. That also shows that my check for is str was too strict, since presumably all strings should be saved as 'U25' (I'm just basing that off what you specified here https://github.com/hdmf-dev/hdmf-zarr/blob/005bfbc0eb3ecee884170ec2a93b7d20c63c9655/src/hdmf_zarr/backend.py#L1032 which seems like it could truncate strings to me, but that part ofc is up to y'all). Replaced it with a check for all types that are mapped to str in __dtypes.

It seems like there should be a single string dtype -> numpy dtype map method, but the __resolve_dtype_helper__ method has a problem in that zarr doesn't support variable length unicode strings ( https://zarr.readthedocs.io/en/stable/tutorial.html#string-arrays ), and so the mapping is str -> '<U', which zarr silently drops and when read it returns an empty string. I kept most of the existing type creation stuff, but added a call to __resolve_dtype_helper__ in the else case, and added a failure mode for an unhandled else in that method since the existing else implicitly assumed that the dtype was a list.

Feels like the dtype conversion logic is getting pretty dispersed at this point, would be nice to consolidate that into one or a few maps (seems like would be good to have separate ones for intended python type vs. storage type, eg. str vs '<U25', but i won't turn this PR into a big refactor <3

This might reveal another issue, but i'm not sure - just prior to the changes I made, a type_str object is made that makes use of the type map: https://github.com/hdmf-dev/hdmf-zarr/blob/9692a98f0568e0f968c89411a67bd4f483c91d9d/src/hdmf_zarr/backend.py#L1010

That gets saved as a dataset attribute: https://github.com/hdmf-dev/hdmf-zarr/blob/9692a98f0568e0f968c89411a67bd4f483c91d9d/src/hdmf_zarr/backend.py#L1020

So if that is supposed to be used by the reader to re-hydrate the object and cast to correct types, it isn't being used. But i'm not sure that's what that is for, so idk.

mavaylon1 commented 8 months ago

ope. so windows treats np.dtype('int') as 32-bits by default i guess, good to know.

looking to see if there was an explicit type map, i can see that there is and that I missed it. That also shows that my check for is str was too strict, since presumably all strings should be saved as 'U25' (I'm just basing that off what you specified here https://github.com/hdmf-dev/hdmf-zarr/blob/005bfbc0eb3ecee884170ec2a93b7d20c63c9655/src/hdmf_zarr/backend.py#L1032 which seems like it could truncate strings to me, but that part ofc is up to y'all). Replaced it with a check for all types that are mapped to str in __dtypes.

It seems like there should be a single string dtype -> numpy dtype map method, but the __resolve_dtype_helper__ method has a problem in that zarr doesn't support variable length unicode strings ( https://zarr.readthedocs.io/en/stable/tutorial.html#string-arrays ), and so the mapping is str -> '<U', which zarr silently drops and when read it returns an empty string. I kept most of the existing type creation stuff, but added a call to __resolve_dtype_helper__ in the else case, and added a failure mode for an unhandled else in that method since the existing else implicitly assumed that the dtype was a list.

Feels like the dtype conversion logic is getting pretty dispersed at this point, would be nice to consolidate that into one or a few maps (seems like would be good to have separate ones for intended python type vs. storage type, eg. str vs '<U25', but i won't turn this PR into a big refactor <3

This might reveal another issue, but i'm not sure - just prior to the changes I made, a type_str object is made that makes use of the type map: https://github.com/hdmf-dev/hdmf-zarr/blob/9692a98f0568e0f968c89411a67bd4f483c91d9d/src/hdmf_zarr/backend.py#L1010

That gets saved as a dataset attribute: https://github.com/hdmf-dev/hdmf-zarr/blob/9692a98f0568e0f968c89411a67bd4f483c91d9d/src/hdmf_zarr/backend.py#L1020

So if that is supposed to be used by the reader to re-hydrate the object and cast to correct types, it isn't being used. But i'm not sure that's what that is for, so idk.

The strings don't need to be length 25. I also found that doing <U would return empty strings downstream. The U25 was me testing to see if providing a cap would fix that and it does. That's an open ended discussion that I thought was fixed with your implementation, but I think there's something lingering that has yet to be solidified. I'm going to dive into this Friday and see what I can come up with.

sneakers-the-rat commented 8 months ago

Basically all that needs to happen is to decide on either a single length, or calculate the Max needed length from the data, ideally in a single mapping method that can get hooked into from the various places one would need to get a zarr storage dtype from a source dtype. Hopefully not too complicated :)

mavaylon1 commented 8 months ago

Hey @sneakers-the-rat let me know if you want to finish this up or if you want me to.

sneakers-the-rat commented 8 months ago

Ah I was not sure what approach we were taking, I can do some tests today to finish it out

sneakers-the-rat commented 8 months ago

Alright, I didn't do seriously careful profiling here, but switching out the 'U25' string with 'O' is a perf wash, and when read back seems to be seamlessly recast back to a string.

On a single run (not robust) of the relevant test cases (cProfile's cumulative time):	Case	U25
`test_read_reference_compound_buf`	0.0375	0.0455
`test_read_reference_compound`	0.0356	0.0325
`test_write_reference_compound`	0.0381	0.0388
total	0.0911	0.1

But on a "naturalistic" case where I just timed the whole write of the dataset i had referenced in the issue, oddly the object encoding was faster. for the write_dataset method:

object: 38s
U25: 39.1s

So pretty much the same. I won't add a test case for very long strings since if I did it would just be a hackjob attached to the presently modified test methods, and that seems like something better suited for a text fixture/parameterization.

sneakers-the-rat commented 8 months ago

Sorry for the force push, just to amend an incorrect commit message

sneakers-the-rat commented 7 months ago

I think this is good to go at this point :) lmk if changes still needed

hdmf-dev / hdmf-zarr