Closed magland closed 3 weeks ago
In this PR I adjusted the default object_codec_class to numcodecs.JSON.
I think the main reason I didn't use JSON at the time was because of cases where objects are not JSON serializable. However, I don't remember whether that was an actual issue I encountered or whether I was just being overly cautious and used Pickle because I knew it would work. But I think if the test suite passes we should be Ok.
Could you also please add a line to the CHANGELOG https://github.com/hdmf-dev/hdmf-zarr/blob/dev/CHANGELOG.md
Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)
Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)
Please add it under a new heading "0.7.0 (upcoming)" since 0.6.0 has already been released
Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)
Please add it under a new heading "0.7.0 (upcoming)" since 0.6.0 has already been released
Done.
Now that I made the actual modification, some tests are failing on my fork. So I guess the process is going to be more involved.
Now that I made the actual modification, some tests are failing on my fork. So I guess the process is going to be more involved.
It looks like there are two main types of errors related to the JSON encoding:
TypeError: Object of type void is not JSON serializable
--> This appears in 5 failing tests, which all seem to be related to writing ZarrWriteUnit
, i.e., I think these are likely all the same issue. AttributeError: 'int' object has no attribute 'append'
--> This appears in 18 failing tests, and as far as I can tell they seem to be all related to writing compound dataset with references, more specifically, these I believe are writing a compound data type that is a mix of strings and int, similar to TimeSeriesReferenceVectorData
in NWB. 2.
AttributeError: 'int' object has no attribute 'append'
The point of failure for this error appears to be in all cases in the write_dataset
function at:
https://github.com/hdmf-dev/hdmf-zarr/pull/146 updated the writing of compound datasets to fix the determination of the dtype and may be relevant here.
Thanks, I'm working on a potential solution.
@oruebel I have pushed a solution where it resorts to a Pickle codec in certain (hopefully rare) cases, and prints a warning.
The cases are: (a) structured array and (b) compound dtype with embedded references
For the purposes of neurosift, I am okay with this, because it seems to be a rare occurrence. But I wasn't sure whether you wanted to either suppress the warnings or try to serialize this in a different way.
Thanks a lot! I think it would be great to see if we can use JSON for those cases too. The TimeSeriesReferenceVectorData is probably the main case in NWB. This is a compound of two ints and a reference. References are stored as a jSOn string, so I think those should be serializable with JSON too.
@mavaylon1 could you take a look at this PR to see if we can use JSON serialization for the compound data case that @magland mentioned as well. It would be good to be consistent in how we serialize objects. I think thus is related to the recent PR on compound data types.
I found this particularly challenging example in the wild. It seems it has a dataset that is a list of lists with embedded references.
000713 - Allen Institute - Visual Behavior - Neuropixels
https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/
intervals/grating_presentations/timeseries
[[3, 135, <HDF5 object reference>], [138, 348, <HDF5 object reference>], [486, 392, <HDF5 object reference>], [878, 474, <HDF5 object reference>], [1352, 373, <HDF5 object reference>], [1725, 394, <HDF5 object reference>], [2119, 428, <HDF5 object reference>], [2547, 363, <HDF5 object reference>], [2910, 396, <HDF5 object reference>], [3306, 430, <HDF5 object reference>], [3736, 405, <HDF5 object reference>], [4141, 400, <HDF5 object reference>], [4541, 495, <HDF5 object reference>], [5036, 372, <HDF5 object reference>], [5408, 528, <HDF5 object reference>], [5936, 369, <HDF5 object reference>], [6305, 358, <HDF5 object reference>], [6663, 347, <HDF5 object reference>], [7010, 401, <HDF5 object reference>], [7411, 372, <HDF5 object reference>], [7783, 441, <HDF5 object reference>], [8224, 478, <HDF5 object reference>], [8702, 428, <HDF5 object reference>], [9130, 490, <HDF5 object reference>], [9620, 426, <HDF5 object reference>], [10046, 642, <HDF5 object reference>], [10688, 454, <HDF5 object reference>], [11142, 366, <HDF5 object reference>], [11508, 570, <HDF5 object reference>], [12078, 594, <HDF5 object reference>], [12672, 387, <HDF5 object reference>], [13059, 484, <HDF5 object reference>], [13543, 586, <HDF5 object reference>], [14129, 627, <HDF5 object reference>], [14756, 374, <HDF5 object reference>], [15130, 390, <HDF5 object reference>], [15520, 362, <HDF5 object reference>], [15882, 414, <HDF5 object reference>], [16296, 624, <HDF5 object reference>], [16920, 353, <HDF5 object reference>], [17273, 474, <HDF5 object reference>], [17747, 416, <HDF5 object reference>], [18163, 413, <HDF5 object reference>], [18576, 453, <HDF5 object reference>], [19029, 355, <HDF5 object reference>], [19384, 369, <HDF5 object reference>], [19753, 478, <HDF5 object reference>], [20231, 499, <HDF5 object reference>], [20730, 391, <HDF5 object reference>], [21121, 419, <HDF5 object reference>], [21540, 476, <HDF5 object reference>], [22016, 377, <HDF5 object reference>], [22393, 393, <HDF5 object reference>], [22786, 593, <HDF5 object reference>], [23379, 414, <HDF5 object reference>], [23793, 385, <HDF5 object reference>], [24178, 477, <HDF5 object reference>], [24655, 566, <HDF5 object reference>], [25221, 363, <HDF5 object reference>], [25584, 510, <HDF5 object reference>], [26094, 633, <HDF5 object reference>], [26727, 651, <HDF5 object reference>], [27378, 598, <HDF5 object reference>], [27976, 351, <HDF5 object reference>], [28327, 465, <HDF5 object reference>], [28792, 347, <HDF5 object reference>], [29139, 458, <HDF5 object reference>], [29597, 408, <HDF5 object reference>], [30005, 534, <HDF5 object reference>], [30539, 368, <HDF5 object reference>], [30907, 411, <HDF5 object reference>], [31318, 683, <HDF5 object reference>], [32001, 516, <HDF5 object reference>], [32517, 532, <HDF5 object reference>], [33049, 379, <HDF5 object reference>], [33428, 395, <HDF5 object reference>], [33823, 413, <HDF5 object reference>], [34236, 485, <HDF5 object reference>], [34721, 457, <HDF5 object reference>], [35178, 386, <HDF5 object reference>], [35564, 554, <HDF5 object reference>], [36118, 473, <HDF5 object reference>], [36591, 505, <HDF5 object reference>], [37096, 353, <HDF5 object reference>], [37449, 388, <HDF5 object reference>], [37837, 355, <HDF5 object reference>], [38192, 361, <HDF5 object reference>], [38553, 380, <HDF5 object reference>], [38933, 512, <HDF5 object reference>], [39445, 403, <HDF5 object reference>], [39848, 544, <HDF5 object reference>], [40392, 351, <HDF5 object reference>], [40743, 440, <HDF5 object reference>], [41183, 387, <HDF5 object reference>], [41570, 359, <HDF5 object reference>], [41929, 485, <HDF5 object reference>], [42414, 429, <HDF5 object reference>], [42843, 404, <HDF5 object reference>], [43247, 417, <HDF5 object reference>], [43664, 458, <HDF5 object reference>], [44122, 443, <HDF5 object reference>], [44565, 422, <HDF5 object reference>], [44987, 410, <HDF5 object reference>], [45397, 510, <HDF5 object reference>], [45907, 526, <HDF5 object reference>], [46433, 638, <HDF5 object reference>], [47071, 447, <HDF5 object reference>], [47518, 456, <HDF5 object reference>], [47974, 352, <HDF5 object reference>], [48326, 493, <HDF5 object reference>], [48819, 392, <HDF5 object reference>], [49211, 420, <HDF5 object reference>], [49631, 391, <HDF5 object reference>], [50022, 377, <HDF5 object reference>], [50399, 366, <HDF5 object reference>], [50765, 428, <HDF5 object reference>], [51193, 351, <HDF5 object reference>], [51544, 356, <HDF5 object reference>], [51900, 347, <HDF5 object reference>], [52247, 656, <HDF5 object reference>], [52903, 427, <HDF5 object reference>], [53330, 430, <HDF5 object reference>], [53760, 210, <HDF5 object reference>]]
I found this particularly challenging example in the wild. It seems it has a dataset that is a list of lists with embedded references.
Yes, that is a TimeSeriesReferenceVectorData
https://nwb-schema.readthedocs.io/en/stable/format.html#timeseriesreferencevectordata which is a compound dataset which has a reference to a TimeSeries
and then two ints with the start and stop index.
I found this particularly challenging example in the wild.
Note, in Zarr, the object reference is being represented as a JSON, so in Zarr the dtype for TimeSeriesReferenceVectorData
ends up being a (int, int, str)
. hdmf-zarr
then deals with translating the SON string part back to resolve the reference. This is done lazily on read. I.e., from a Zarr perspective, I think this datatype of (int, int, str)
should in principle still be serializable in JSON.
Attention: Patch coverage is 89.47368%
with 2 lines
in your changes are missing coverage. Please review.
Project coverage is 86.07%. Comparing base (
556ed12
) to head (29adcca
).
Files | Patch % | Lines |
---|---|---|
src/hdmf_zarr/backend.py | 88.88% | 0 Missing and 2 partials :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Can we turn the codec class into a getter method so we can encapsulate all switching logic in one place?
currently there's: https://github.com/hdmf-dev/hdmf-zarr/blob/29adcca1af14bb03255fdf52723daedf6c67ce44/src/hdmf_zarr/backend.py#L154
the logic of when to use which codec is split in a few places, but it seems to ultimately depend on builder.dtype
:
this check here: https://github.com/hdmf-dev/hdmf-zarr/blob/29adcca1af14bb03255fdf52723daedf6c67ce44/src/hdmf_zarr/backend.py#L1054 has its roots here: https://github.com/hdmf-dev/hdmf-zarr/blob/29adcca1af14bb03255fdf52723daedf6c67ce44/src/hdmf_zarr/backend.py#L955
and similarly this check: https://github.com/hdmf-dev/hdmf-zarr/blob/29adcca1af14bb03255fdf52723daedf6c67ce44/src/hdmf_zarr/backend.py#L1279 has its roots here: https://github.com/hdmf-dev/hdmf-zarr/blob/29adcca1af14bb03255fdf52723daedf6c67ce44/src/hdmf_zarr/backend.py#L1218 called from here: https://github.com/hdmf-dev/hdmf-zarr/blob/29adcca1af14bb03255fdf52723daedf6c67ce44/src/hdmf_zarr/backend.py#L1071
which is just the other leg of the above check, and L1276-1291 is very similar to L1037-1050.
write_builder
and the other write_*
methods are almost private methods since the docs only demonstrate using write()
, but the builder is passed through them rather than stored as an instance attr (presumably since it gets remade on each write()
? not really sure) so we can't assume we have it and use the above object_codec_class
property, but it looks like every time we get __codec_cls
we also have the builder or the options
dict that gets constructed from it in scope, so couldn't we do something like get_codec_cls(dtype)
?
not my package, but when i was working on this before i had left this note: https://github.com/hdmf-dev/hdmf-zarr/blob/29adcca1af14bb03255fdf52723daedf6c67ce44/src/hdmf_zarr/backend.py#L1033-L1036 , which seems to be out of date bc we're using objects rather than converting to strings (presumably the root if the issue here), and so it also seems like we can consolidate the several places where we create and check a dtype into that __resolve_dtype_helper__
method which could then make get_codec_cls()
well defined as well.
@sneakers-the-rat some more details below, but to make a long story short, my hope is that we can:
object_codec_class
parameter on the ZarrIO.__init__
method and not allow users to manually specify the codec to use (this is probably a good idea in general to avoid having to deal with to many corner cases) @magland @sneakers-the-rat sorry that I have not had the chance to dive into the code for this PR yet, but those are things I hope we can do before merging this PR, i.e., try to do 1. if possible and implement 2. if we can't make 1. work.
the logic of when to use which codec is split in a few places
TBH, my hope is that we can avoid having to use different codecs if possible. The use of numcodecs.Pickle
was convenient, because it is pretty reliable in handling even complex cases, however, it makes it hard for non-Python codes (e.g., from JavaScript in the case of neurosift) to use the Zarr files directly. numcodecs.JSON
makes that easier, but JSON is not as flexible as Pickle in terms of the cases it handles. The main issue seems to be structured arrays. However, in most cases those are a combination of ints and strings in our case, so my hope is that we can handle those in JSON as well, and only fall back to Pickle
when absolutely necessary.
Can we turn the codec class into a getter method so we can encapsulate all switching logic in one place?
Currently the backend uses a single codec throughout, so a getter was not need, but if we really need to use multiple codes (which, again, I ideally would like to avoid if possible), then I totally agree that encapsulating this logic in a method is a good idea. We probably, then also need to get rid of object_codec_class
as a parameter on __init__
since we need to control when to use what codec https://github.com/hdmf-dev/hdmf-zarr/blob/897d3b95ddfadd9364b2e2bc019e6ade86e920ce/src/hdmf_zarr/backend.py#L95-L97
so couldn't we do something like
get_codec_cls(dtype)
Yes, I believe that should work. The object_codec_class
property is mainly there since the codec is currently configurable on __init__
.
@oruebel I'll dive in this week.
Recap and Updates: We ideally only want one codec. I've been trying to get the compound case for references to work but I've ran into some issues. Using the dev branch of numcodecs, the problem is decoding the data. I made an issue ticket on the numcodec github and will try to find a fix. https://github.com/zarr-developers/numcodecs/issues/518
I have an "almost" proof of concept for this. I will share this week if it holds up with more exploration.
Motivation
See #171
In this PR I adjusted the default object_codec_class to numcodecs.JSON.
In addition, I also exposed this parameter in the constructor of NWBZarrIO
Fix #171
How to test the behavior?
I ran the following with my forked version, and verified that the resulting zarr opens properly in Neurosift:
Checklist
Note: I wasn't sure how to update CHANGELOG.md -- I mean where to put the new tet
ruff
from the source directory.