sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
26 stars 7 forks source link

Run tests against Zarr 3 #254

Open tomwhite opened 3 months ago

tomwhite commented 3 months ago

There's an alpha release available now that can be installed using pip install --pre

https://pypi.org/project/zarr/3.0.0a0/

jeromekelleher commented 3 months ago

I had a quick look at this and it seems like quite a bit of stuff is broken. I hit a wall at finding a way to specify a name for a new array in a Group. This seems pretty basic and I don't have time to chase down the details, so I'll try again when the next alpha comes out.

tomwhite commented 3 months ago

I'll try again when the next alpha comes out

There is no next alpha according to https://github.com/zarr-developers/zarr-python/issues/1777.

jeromekelleher commented 3 months ago

I just had another go with the tip of the v3 branch, and still hitting walls with array creation. The support for creating v2 arrays seems to be pretty thin, and it's not clear at all to me how we're supposed to go about it. I don't really know where to start tbh.

jeromekelleher commented 3 months ago

For reference:

python3 -m pip install pip install git+https://zarr-developers/zarr-python
tomwhite commented 3 months ago

I'll take a look

tomwhite commented 3 months ago

Here's the branch where I've been experimenting with Zarr v3: https://github.com/tomwhite/bio2zarr/tree/zarr-v3

After making changes to adapt to the different v3 API, it's now failing because Zarr v3 doesn't support string dtypes:

``` _____________________________________________________________ TestVcfZarrWriterExample.test_encode_partition[0] _____________________________________________________________ self = icf_path = PosixPath('/private/var/folders/9j/h1v35g4166z6zt816fq7wymc0000gn/T/pytest-of-tom/pytest-1771/data11/example.exploded') tmp_path = PosixPath('/private/var/folders/9j/h1v35g4166z6zt816fq7wymc0000gn/T/pytest-of-tom/pytest-1771/test_encode_partition_0_0'), partition = 0 @pytest.mark.parametrize("partition", [0, 1, 2]) def test_encode_partition(self, icf_path, tmp_path, partition): zarr_path = tmp_path / "x.zarr" vcf2zarr.encode_init(icf_path, zarr_path, 3, variants_chunk_size=3) partition_path = zarr_path / "wip" / "partitions" / f"p{partition}" assert not partition_path.exists() > vcf2zarr.encode_partition(zarr_path, partition) tests/test_vcz.py:508: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ bio2zarr/vcf2zarr/vcz.py:1064: in encode_partition writer.encode_partition(partition) bio2zarr/vcf2zarr/vcz.py:683: in encode_partition self.encode_id_partition(partition_index) bio2zarr/vcf2zarr/vcz.py:801: in encode_id_partition vid.flush() bio2zarr/core.py:145: in flush sync_flush_1d_array( bio2zarr/core.py:163: in sync_flush_1d_array zarr_array[offset : offset + np_buffer.shape[0]] = np_buffer ../zarr-python/src/zarr/array.py:984: in __setitem__ self.set_basic_selection(cast(BasicSelection, pure_selection), value, fields=fields) ../zarr-python/src/zarr/array.py:1198: in set_basic_selection sync(self._async_array._set_selection(indexer, value, fields=fields, prototype=prototype)) ../zarr-python/src/zarr/sync.py:92: in sync raise return_result ../zarr-python/src/zarr/sync.py:51: in _runner return await coro ../zarr-python/src/zarr/array.py:518: in _set_selection value_buffer = prototype.nd_buffer.from_ndarray_like(value) ../zarr-python/src/zarr/buffer.py:339: in from_ndarray_like return cls(ndarray_like) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <[AttributeError("'NDBuffer' object has no attribute '_data'") raised in repr()] NDBuffer object at 0x7f9ddd467250> array = array(['.', '.', 'rs6054257'], dtype=object) def __init__(self, array: NDArrayLike): # assert array.ndim > 0 > assert array.dtype != object E AssertionError ../zarr-python/src/zarr/buffer.py:286: AssertionError ```

This is a major limitation for us. The Zarr v3 core spec does not cover strings.

They are likely to be a future extension:

The set of data types specified in v3 is less than in v2. Additional data types will be defined via extensions.

jeromekelleher commented 3 months ago

Thanks Tom!

Yeesh, the lack of strings is scary. Looks like we'll be on v2 for a long time then.

d-v-b commented 2 months ago

fwiw I would love to chart a path to getting more dtypes into zarr v3 (since i happen to be working on the v3 fill value normalization right now). as you noted, there's a dtype extension mechanism built into the spec but we haven't exercised it yet. Could you share or link to a description of how you are using string arrays, either here or in a discussion over in zarr specs? That might help kick-start things.

tomwhite commented 2 months ago

Hi @d-v-b, thanks for commenting! We'd actually like to encode variable-length strings, which use an object dtype and a numcodecs.vlen.VLenUTF8 codec.

I've added a comment explaining our use case to the discussion at https://github.com/zarr-developers/zarr-specs/issues/83

tomwhite commented 2 months ago

I filed the following issues to improve Zarr Python v3 API compatibility here:

tomwhite commented 2 months ago

As an experiment I tried creating a VLenUTF8Codec which uses numcodecs.vlen.VLenUTF8. I can successfully run a test that writes a VCF Zarr file and then validates it:

pytest 'tests/test_vcf_examples.py::test_by_validating[sample.vcf.gz]'
jeromekelleher commented 2 months ago

Amazing! :tada:

tomwhite commented 2 months ago

I updated https://github.com/tomwhite/bio2zarr/tree/zarr-v3 to use the code from https://github.com/zarr-developers/zarr-python/pull/2036 and the test still passes.