Open tomwhite opened 3 months ago
I had a quick look at this and it seems like quite a bit of stuff is broken. I hit a wall at finding a way to specify a name
for a new array in a Group. This seems pretty basic and I don't have time to chase down the details, so I'll try again when the next alpha comes out.
I'll try again when the next alpha comes out
There is no next alpha according to https://github.com/zarr-developers/zarr-python/issues/1777.
I just had another go with the tip of the v3 branch, and still hitting walls with array creation. The support for creating v2 arrays seems to be pretty thin, and it's not clear at all to me how we're supposed to go about it. I don't really know where to start tbh.
For reference:
python3 -m pip install pip install git+https://zarr-developers/zarr-python
I'll take a look
Here's the branch where I've been experimenting with Zarr v3: https://github.com/tomwhite/bio2zarr/tree/zarr-v3
After making changes to adapt to the different v3 API, it's now failing because Zarr v3 doesn't support string dtypes:
This is a major limitation for us. The Zarr v3 core spec does not cover strings.
They are likely to be a future extension:
The set of data types specified in v3 is less than in v2. Additional data types will be defined via extensions.
Thanks Tom!
Yeesh, the lack of strings is scary. Looks like we'll be on v2 for a long time then.
fwiw I would love to chart a path to getting more dtypes into zarr v3 (since i happen to be working on the v3 fill value normalization right now). as you noted, there's a dtype extension mechanism built into the spec but we haven't exercised it yet. Could you share or link to a description of how you are using string arrays, either here or in a discussion over in zarr specs? That might help kick-start things.
Hi @d-v-b, thanks for commenting! We'd actually like to encode variable-length strings, which use an object
dtype and a numcodecs.vlen.VLenUTF8
codec.
I've added a comment explaining our use case to the discussion at https://github.com/zarr-developers/zarr-specs/issues/83
I filed the following issues to improve Zarr Python v3 API compatibility here:
As an experiment I tried creating a VLenUTF8Codec
which uses numcodecs.vlen.VLenUTF8
. I can successfully run a test that writes a VCF Zarr file and then validates it:
pytest 'tests/test_vcf_examples.py::test_by_validating[sample.vcf.gz]'
Amazing! :tada:
I updated https://github.com/tomwhite/bio2zarr/tree/zarr-v3 to use the code from https://github.com/zarr-developers/zarr-python/pull/2036 and the test still passes.
There's an alpha release available now that can be installed using
pip install --pre
https://pypi.org/project/zarr/3.0.0a0/