manzt / numcodecs.js

Buffer compression and transformation codecs
MIT License
29 stars 6 forks source link

VLenUtf8 Support #28

Open ilan-gold opened 3 years ago

ilan-gold commented 3 years ago

VlenUtf8 is a common codec for string arrays, and porting it should be relatively straightforward: https://github.com/zarr-developers/numcodecs/blob/2c1aff98e965c3c4747d9881d8b8d4aad91adb3a/numcodecs/vlen.pyx#L48-L178

I'm working on doing this for Vitessce, so if you're interested let me know!

ilan-gold commented 3 years ago

Hmmm, it seems that this is not a codec but a "filter." Does this belong in zarr.js then?

ilan-gold commented 3 years ago

Seems to work well: https://github.com/vitessce/vitessce/pull/948/files

ilan-gold commented 3 years ago

Can contribute if you're interested but not sure how you want to set up filters here/zarr.js

manzt commented 3 years ago

I think it makes sense to add filters to numcodecs.js (that's where they live for zarr-python, and they implement the codecs interface). However, currently zarr.js doesn't support using filters. That alone should be straigh-forward to add (essentially decode a chunk and then run the decoded chunk through a filter codec); however, the real issue is the "dtype" itself here.

Zarr.js only supports (numeric) dtypes that have an analogous TypedArray. There are no variably sized TypedArrays in JavaScript so the decoded data would need to lives in a JavaScript Array. Zarr.js relies on TypedArray APIs in both RawArray and NestedArray, so it would be tricky to add a dtype that currently isn't supported.

ilan-gold commented 3 years ago

@manzt I'm not as familiar with this so I defer to you here. Would it make sense to create a new typed array like StringArray? Or some sort of catch-all for non-recognized types? This is definitely out of my wheelhouse for me so if you want to come up with a roadmap here, I can help fill in with PR's etc.

SiggyF commented 2 years ago

Thank you for opening this issue. I'm looking into this issue to implement the support for zarr.js. I had a call on this topic with the develop @gzuidhof.

I looked at some details on how this is done in the Python package numcodecs with vlenutf8 support and noticed the following things:

Other things to note are:

Based on the assumption that the implementation approach should follow the python numcodecs implementation. I would suggest to do the following roadmap:

We're glad to contribute in any of the subtasks.

h-mayorquin commented 5 months ago

Following on this. Is this something that the developers are still interested? I might contribute with this one.