default object_codec_class -> JSON

magland commented 7 months ago

Motivation

See #171

I was working on adding nwb-zarr support to neurosift and I ran into a problem where many of the datasets within the zarr were pickle-encoded. It is difficult to read pickled data in languages other than Python, and it is practically impossible using JavaScript in the browser. So I would like to request that hdmf-zarr be updated to avoid using pickling when writing datasets.

In this PR I adjusted the default object_codec_class to numcodecs.JSON.

In addition, I also exposed this parameter in the constructor of NWBZarrIO

Fix #171

How to test the behavior?

I ran the following with my forked version, and verified that the resulting zarr opens properly in Neurosift:

import os
from pynwb import NWBHDF5IO
from hdmf_zarr.nwb import NWBZarrIO
import numcodecs

filename = 'sub-anm00239123_ses-20170627T093549_ecephys+ogen.nwb'

def main():
    zarr_filename = 'test.zarr'
    with NWBHDF5IO(path=filename, mode='r', load_namespaces=False) as read_io:  # Create HDF5 IO object for read
        with NWBZarrIO(path=zarr_filename, mode='w') as export_io:         # Create Zarr IO object for write
            export_io.export(src_io=read_io, write_args=dict(link_data=False))   # Export from HDF5 to Zarr

if __name__ == "__main__":
    main()

Checklist

Note: I wasn't sure how to update CHANGELOG.md -- I mean where to put the new tet

[ ] Did you update CHANGELOG.md with your changes?
[x] Have you checked our Contributing document?
[x] Have you ensured the PR clearly describes the problem and the solution?
[x] Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
[x] Have you checked to ensure that there aren't other open Pull Requests for the same change?
[x] Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

oruebel commented 7 months ago

In this PR I adjusted the default object_codec_class to numcodecs.JSON.

I think the main reason I didn't use JSON at the time was because of cases where objects are not JSON serializable. However, I don't remember whether that was an actual issue I encountered or whether I was just being overly cautious and used Pickle because I knew it would work. But I think if the test suite passes we should be Ok.

magland commented 7 months ago

Could you also please add a line to the CHANGELOG https://github.com/hdmf-dev/hdmf-zarr/blob/dev/CHANGELOG.md

Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)

oruebel commented 7 months ago

Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)

Please add it under a new heading "0.7.0 (upcoming)" since 0.6.0 has already been released

magland commented 7 months ago

Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)

Please add it under a new heading "0.7.0 (upcoming)" since 0.6.0 has already been released

Done.

magland commented 7 months ago

Now that I made the actual modification, some tests are failing on my fork. So I guess the process is going to be more involved.

oruebel commented 7 months ago

Now that I made the actual modification, some tests are failing on my fork. So I guess the process is going to be more involved.

It looks like there are two main types of errors related to the JSON encoding:

TypeError: Object of type void is not JSON serializable --> This appears in 5 failing tests, which all seem to be related to writing ZarrWriteUnit, i.e., I think these are likely all the same issue.
AttributeError: 'int' object has no attribute 'append' --> This appears in 18 failing tests, and as far as I can tell they seem to be all related to writing compound dataset with references, more specifically, these I believe are writing a compound data type that is a mix of strings and int, similar to TimeSeriesReferenceVectorData in NWB.

oruebel commented 7 months ago

2. AttributeError: 'int' object has no attribute 'append'

The point of failure for this error appears to be in all cases in the write_dataset function at:

https://github.com/hdmf-dev/hdmf-zarr/blob/556ed123e5cc07a434fd13593c06dca861267b8f/src/hdmf_zarr/backend.py#L1055-L1061

https://github.com/hdmf-dev/hdmf-zarr/pull/146 updated the writing of compound datasets to fix the determination of the dtype and may be relevant here.

magland commented 7 months ago

Thanks, I'm working on a potential solution.

magland commented 7 months ago

@oruebel I have pushed a solution where it resorts to a Pickle codec in certain (hopefully rare) cases, and prints a warning.

The cases are: (a) structured array and (b) compound dtype with embedded references

For the purposes of neurosift, I am okay with this, because it seems to be a rare occurrence. But I wasn't sure whether you wanted to either suppress the warnings or try to serialize this in a different way.

oruebel commented 7 months ago

Thanks a lot! I think it would be great to see if we can use JSON for those cases too. The TimeSeriesReferenceVectorData is probably the main case in NWB. This is a compound of two ints and a reference. References are stored as a jSOn string, so I think those should be serializable with JSON too.

@mavaylon1 could you take a look at this PR to see if we can use JSON serialization for the compound data case that @magland mentioned as well. It would be good to be consistent in how we serialize objects. I think thus is related to the recent PR on compound data types.

magland commented 7 months ago

I found this particularly challenging example in the wild. It seems it has a dataset that is a list of lists with embedded references.

000713 - Allen Institute - Visual Behavior - Neuropixels

https://neurosift.app/?p=/nwb&url=https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/&dandisetId=000713&dandisetVersion=draft

https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/

intervals/grating_presentations/timeseries

[[3, 135, <HDF5 object reference>], [138, 348, <HDF5 object reference>], [486, 392, <HDF5 object reference>], [878, 474, <HDF5 object reference>], [1352, 373, <HDF5 object reference>], [1725, 394, <HDF5 object reference>], [2119, 428, <HDF5 object reference>], [2547, 363, <HDF5 object reference>], [2910, 396, <HDF5 object reference>], [3306, 430, <HDF5 object reference>], [3736, 405, <HDF5 object reference>], [4141, 400, <HDF5 object reference>], [4541, 495, <HDF5 object reference>], [5036, 372, <HDF5 object reference>], [5408, 528, <HDF5 object reference>], [5936, 369, <HDF5 object reference>], [6305, 358, <HDF5 object reference>], [6663, 347, <HDF5 object reference>], [7010, 401, <HDF5 object reference>], [7411, 372, <HDF5 object reference>], [7783, 441, <HDF5 object reference>], [8224, 478, <HDF5 object reference>], [8702, 428, <HDF5 object reference>], [9130, 490, <HDF5 object reference>], [9620, 426, <HDF5 object reference>], [10046, 642, <HDF5 object reference>], [10688, 454, <HDF5 object reference>], [11142, 366, <HDF5 object reference>], [11508, 570, <HDF5 object reference>], [12078, 594, <HDF5 object reference>], [12672, 387, <HDF5 object reference>], [13059, 484, <HDF5 object reference>], [13543, 586, <HDF5 object reference>], [14129, 627, <HDF5 object reference>], [14756, 374, <HDF5 object reference>], [15130, 390, <HDF5 object reference>], [15520, 362, <HDF5 object reference>], [15882, 414, <HDF5 object reference>], [16296, 624, <HDF5 object reference>], [16920, 353, <HDF5 object reference>], [17273, 474, <HDF5 object reference>], [17747, 416, <HDF5 object reference>], [18163, 413, <HDF5 object reference>], [18576, 453, <HDF5 object reference>], [19029, 355, <HDF5 object reference>], [19384, 369, <HDF5 object reference>], [19753, 478, <HDF5 object reference>], [20231, 499, <HDF5 object reference>], [20730, 391, <HDF5 object reference>], [21121, 419, <HDF5 object reference>], [21540, 476, <HDF5 object reference>], [22016, 377, <HDF5 object reference>], [22393, 393, <HDF5 object reference>], [22786, 593, <HDF5 object reference>], [23379, 414, <HDF5 object reference>], [23793, 385, <HDF5 object reference>], [24178, 477, <HDF5 object reference>], [24655, 566, <HDF5 object reference>], [25221, 363, <HDF5 object reference>], [25584, 510, <HDF5 object reference>], [26094, 633, <HDF5 object reference>], [26727, 651, <HDF5 object reference>], [27378, 598, <HDF5 object reference>], [27976, 351, <HDF5 object reference>], [28327, 465, <HDF5 object reference>], [28792, 347, <HDF5 object reference>], [29139, 458, <HDF5 object reference>], [29597, 408, <HDF5 object reference>], [30005, 534, <HDF5 object reference>], [30539, 368, <HDF5 object reference>], [30907, 411, <HDF5 object reference>], [31318, 683, <HDF5 object reference>], [32001, 516, <HDF5 object reference>], [32517, 532, <HDF5 object reference>], [33049, 379, <HDF5 object reference>], [33428, 395, <HDF5 object reference>], [33823, 413, <HDF5 object reference>], [34236, 485, <HDF5 object reference>], [34721, 457, <HDF5 object reference>], [35178, 386, <HDF5 object reference>], [35564, 554, <HDF5 object reference>], [36118, 473, <HDF5 object reference>], [36591, 505, <HDF5 object reference>], [37096, 353, <HDF5 object reference>], [37449, 388, <HDF5 object reference>], [37837, 355, <HDF5 object reference>], [38192, 361, <HDF5 object reference>], [38553, 380, <HDF5 object reference>], [38933, 512, <HDF5 object reference>], [39445, 403, <HDF5 object reference>], [39848, 544, <HDF5 object reference>], [40392, 351, <HDF5 object reference>], [40743, 440, <HDF5 object reference>], [41183, 387, <HDF5 object reference>], [41570, 359, <HDF5 object reference>], [41929, 485, <HDF5 object reference>], [42414, 429, <HDF5 object reference>], [42843, 404, <HDF5 object reference>], [43247, 417, <HDF5 object reference>], [43664, 458, <HDF5 object reference>], [44122, 443, <HDF5 object reference>], [44565, 422, <HDF5 object reference>], [44987, 410, <HDF5 object reference>], [45397, 510, <HDF5 object reference>], [45907, 526, <HDF5 object reference>], [46433, 638, <HDF5 object reference>], [47071, 447, <HDF5 object reference>], [47518, 456, <HDF5 object reference>], [47974, 352, <HDF5 object reference>], [48326, 493, <HDF5 object reference>], [48819, 392, <HDF5 object reference>], [49211, 420, <HDF5 object reference>], [49631, 391, <HDF5 object reference>], [50022, 377, <HDF5 object reference>], [50399, 366, <HDF5 object reference>], [50765, 428, <HDF5 object reference>], [51193, 351, <HDF5 object reference>], [51544, 356, <HDF5 object reference>], [51900, 347, <HDF5 object reference>], [52247, 656, <HDF5 object reference>], [52903, 427, <HDF5 object reference>], [53330, 430, <HDF5 object reference>], [53760, 210, <HDF5 object reference>]]

oruebel commented 7 months ago

I found this particularly challenging example in the wild. It seems it has a dataset that is a list of lists with embedded references.

Yes, that is a TimeSeriesReferenceVectorData https://nwb-schema.readthedocs.io/en/stable/format.html#timeseriesreferencevectordata which is a compound dataset which has a reference to a TimeSeries and then two ints with the start and stop index.

oruebel commented 7 months ago

I found this particularly challenging example in the wild.

Note, in Zarr, the object reference is being represented as a JSON, so in Zarr the dtype for TimeSeriesReferenceVectorData ends up being a (int, int, str). hdmf-zarr then deals with translating the SON string part back to resolve the reference. This is done lazily on read. I.e., from a Zarr perspective, I think this datatype of (int, int, str) should in principle still be serializable in JSON.

codecov-commenter commented 7 months ago

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 86.07%. Comparing base (556ed12) to head (29adcca).

Files	Patch %	Lines
src/hdmf_zarr/backend.py	88.88%	0 Missing and 2 partials :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## dev #173 +/- ## ========================================== + Coverage 86.04% 86.07% +0.03% ========================================== Files 5 5 Lines 1161 1178 +17 Branches 296 303 +7 ========================================== + Hits 999 1014 +15 Misses 107 107 - Partials 55 57 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sneakers-the-rat commented 6 months ago

Can we turn the codec class into a getter method so we can encapsulate all switching logic in one place?

currently there's: https://github.com/hdmf-dev/hdmf-zarr/blob/29adcca1af14bb03255fdf52723daedf6c67ce44/src/hdmf_zarr/backend.py#L154

the logic of when to use which codec is split in a few places, but it seems to ultimately depend on builder.dtype:

this check here: https://github.com/hdmf-dev/hdmf-zarr/blob/29adcca1af14bb03255fdf52723daedf6c67ce44/src/hdmf_zarr/backend.py#L1054 has its roots here: https://github.com/hdmf-dev/hdmf-zarr/blob/29adcca1af14bb03255fdf52723daedf6c67ce44/src/hdmf_zarr/backend.py#L955

which is just the other leg of the above check, and L1276-1291 is very similar to L1037-1050.

write_builder and the other write_* methods are almost private methods since the docs only demonstrate using write(), but the builder is passed through them rather than stored as an instance attr (presumably since it gets remade on each write()? not really sure) so we can't assume we have it and use the above object_codec_class property, but it looks like every time we get __codec_cls we also have the builder or the options dict that gets constructed from it in scope, so couldn't we do something like get_codec_cls(dtype) ?

not my package, but when i was working on this before i had left this note: https://github.com/hdmf-dev/hdmf-zarr/blob/29adcca1af14bb03255fdf52723daedf6c67ce44/src/hdmf_zarr/backend.py#L1033-L1036 , which seems to be out of date bc we're using objects rather than converting to strings (presumably the root if the issue here), and so it also seems like we can consolidate the several places where we create and check a dtype into that __resolve_dtype_helper__ method which could then make get_codec_cls() well defined as well.

oruebel commented 6 months ago

@sneakers-the-rat some more details below, but to make a long story short, my hope is that we can:

Avoid having to use a mix of different codecs in the same file and as such avoid the need for logic to switch between codecs if possible
If we really need to use multiple codecs, then:
- Yes, we should encapsulate the logic for selecting codes in a function, as you proposed
- We should remove the object_codec_class parameter on the ZarrIO.__init__ method and not allow users to manually specify the codec to use (this is probably a good idea in general to avoid having to deal with to many corner cases)

@magland @sneakers-the-rat sorry that I have not had the chance to dive into the code for this PR yet, but those are things I hope we can do before merging this PR, i.e., try to do 1. if possible and implement 2. if we can't make 1. work.

the logic of when to use which codec is split in a few places

TBH, my hope is that we can avoid having to use different codecs if possible. The use of numcodecs.Pickle was convenient, because it is pretty reliable in handling even complex cases, however, it makes it hard for non-Python codes (e.g., from JavaScript in the case of neurosift) to use the Zarr files directly. numcodecs.JSON makes that easier, but JSON is not as flexible as Pickle in terms of the cases it handles. The main issue seems to be structured arrays. However, in most cases those are a combination of ints and strings in our case, so my hope is that we can handle those in JSON as well, and only fall back to Pickle when absolutely necessary.

Can we turn the codec class into a getter method so we can encapsulate all switching logic in one place?

Currently the backend uses a single codec throughout, so a getter was not need, but if we really need to use multiple codes (which, again, I ideally would like to avoid if possible), then I totally agree that encapsulating this logic in a method is a good idea. We probably, then also need to get rid of object_codec_class as a parameter on __init__ since we need to control when to use what codec https://github.com/hdmf-dev/hdmf-zarr/blob/897d3b95ddfadd9364b2e2bc019e6ade86e920ce/src/hdmf_zarr/backend.py#L95-L97

so couldn't we do something like get_codec_cls(dtype)

Yes, I believe that should work. The object_codec_class property is mainly there since the codec is currently configurable on __init__.

mavaylon1 commented 6 months ago

@oruebel I'll dive in this week.

mavaylon1 commented 6 months ago

Recap and Updates: We ideally only want one codec. I've been trying to get the compound case for references to work but I've ran into some issues. Using the dev branch of numcodecs, the problem is decoding the data. I made an issue ticket on the numcodec github and will try to find a fix. https://github.com/zarr-developers/numcodecs/issues/518

mavaylon1 commented 5 months ago

I have an "almost" proof of concept for this. I will share this week if it holds up with more exploration.

hdmf-dev / hdmf-zarr