[Feature]: Write zarr without using pickle

magland commented 7 months ago

What would you like to see added to HDMF-ZARR?

I was working on adding nwb-zarr support to neurosift and I ran into a problem where many of the datasets within the zarr were pickle-encoded. It is difficult to read pickled data in languages other than Python, and it is practically impossible using JavaScript in the browser. So I would like to request that hdmf-zarr be updated to avoid using pickling when writing datasets.

Is your feature request related to a problem?

No response

What solution would you like?

@bendichter and I created a script for doing the h5->zarr conversion without using pickle: https://gist.github.com/magland/39c5c5c5afabb982b5dd5d2367f2179b

and we confirmed that it (a) the output is loadable in hdmf-zarr and (b) it works with Neurosift

We have only tried it on the one nwb file so far.

Perhaps this script could be helpful in making the modifications.

I perused the hdmf-zarr source code, but it wasn't clear to me where the changes needed to take place.

Tagging: @rly

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

[X] I agree to follow this project's Code of Conduct
[X] Have you checked the Contributing document?
[X] Have you ensured this change was not already requested?

oruebel commented 7 months ago

Thanks for the suggestion. I think this means updating the calls to require_dataset in backend.py to set the object_codec. I believe currently the object_codec is not being specified, so it should use whatever the default in Zarr is. The question is, which encoder is the best one to use, i.e., is most portable and most performant in terms of en/decoding speed and size. Do you have any thoughts on the encoder to use?

oruebel commented 7 months ago

@magland is this something you are interested in helping with by contributing a PR or would you like us to take a look at this?

magland commented 7 months ago

@oruebel I am happy to create a PR, but I will need some help knowing what to change.

Now that you pointed to require_dataset I am seeing lines like this:

dset = parent.require_dataset(name,
                                              shape=shape,
                                              dtype=object,
                                              object_codec=self.__codec_cls(),
                                              **options['io_settings'])

and

self.__codec_cls = numcodecs.pickles.Pickle if object_codec_class is None else object_codec_class

That should give me something to go on.

I'll working on putting something together for you to look at.

oruebel commented 7 months ago

I'll working on putting something together for you to look at.

Great! Thanks for the help!

A first test could be to just set the object_codec_class parameter when creating the ZarrIO object via ZarrIO.__init__. If that produces files that work for you, then this may be as simple as changing the default for the object codec here:

https://github.com/hdmf-dev/hdmf-zarr/blob/556ed123e5cc07a434fd13593c06dca861267b8f/src/hdmf_zarr/backend.py#L124

I think there may be a couple of places where the object_codec is not being set. I'm not sure right now if we may need to add the object codec as an additional argument to those calls, or whether those calls don't create datasets that require object encoding. However, I think object codec is being ignored by Zarr for all other data (i.e., if we don't store objects), so I think it may be safe to just add it to those calls, from a quick look, I think these are the two calls don't specify the object codec:

https://github.com/hdmf-dev/hdmf-zarr/blob/556ed123e5cc07a434fd13593c06dca861267b8f/src/hdmf_zarr/backend.py#L1272

https://github.com/hdmf-dev/hdmf-zarr/blob/556ed123e5cc07a434fd13593c06dca861267b8f/src/hdmf_zarr/backend.py#L1312

In addition to datasets, another place where encoding could potentially come up is when writing attributes. I don't think this should be an issue, because a) Zarr stores attributes in JSON, and b) I don't think NWB stores objects in attributes. However, I wanted to mention it just in case. If attribute encoding should come up as an issue, those are written here:

https://github.com/hdmf-dev/hdmf-zarr/blob/556ed123e5cc07a434fd13593c06dca861267b8f/src/hdmf_zarr/backend.py#L558

magland commented 7 months ago

Thanks @oruebel

I tried out changing the default codec class in ZarrIO to numcodecs.JSON as you suggested, and indeed that worked! At least for the dataset I tested.

I'll make a PR and we'll see if it passes the tests.

hdmf-dev / hdmf-zarr