pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

[Proposal] allow object_codec parameter of type VLenArray #102

Closed maxime915 closed 2 years ago

maxime915 commented 2 years ago

Goal

Ragged arrays can hold data that would otherwise not fit in a traditional array. In mass spectrometry, for example, some scenarios lead to planar images with a variable number of sampling point across the (x, y) coordinates.

I am currently developing a way to convert mass spectrometry images to Zarr arrays and use a 2-stage conversion, where the image is first converted to a ragged array of chunks (1, 1) with no compressor and should then be converted to an equivalent array with other chunk/compressor parameters. The rechunker package would be a perfect fit for this task, but converting ragged array is currently not optimall.

Problem to solve

Consider this Zarr array :

Type               : zarr.core.Array
Data type          : object
Shape              : (142, 141)
Chunk shape        : (1, 1)
...
Filter [0]         : VLenArray(dtype='<f4')
Compressor         : None
Store type         : zarr.storage.DirectoryStore
...

That should be re-chunked to produce the following (or any other arbitrary chunk shape & compressor) :

Type               : zarr.core.Array
Data type          : object
Shape              : (142, 141)
Chunk shape        : (25, 40)
...
Filter [0]         : VLenArray(dtype='<f4')
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : zarr.storage.DirectoryStore
...

The easy solution would be to use rechunker, with target_chunks=(25, 40) and target_options={'compressor': 'default'}. However, running this code raises an exception ValueError: missing object_codec for object array (".../lib/python3.8/site-packages/zarr/storage.py", line 429, in _init_array_metadata)

The immediate solution would be to add the required option, giving target_options={'compressor': 'default', 'object_codec': numcodecs.VLenArray(parser.intensityPrecision)}. However, running this code raises an exception ValueError: Zarr options must not include object_codec (got object_codec=VLenArray(dtype='<f4')) [...] (in _validate_options).

Proposed solution

To rechunk a ragged array, it is necessary to pass the object_codec parameter with a value of type numcodecs.vlen.VLenArray. This would allow re-chunking of this type of arrays.

The proposed changes allow using the object_codec parameters, but only with a value of type numcodecs.vlen.VLenArray. The goal of this restriction is to reduce the chance of any unintentional alteration of the library.

Testing

I checked on a few Zarr ragged array and didn't have any issue.

The output of pytest for this package shows 139 passed, 41 skipped, 18 xfailed, 48 warnings in 24.93s without any modification and 139 passed, 41 skipped, 18 xfailed, 48 warnings in 24.91s with the modifications from this pull request on my machine (Ubuntu 20.04).

I did not add any test case to the package.

EDIT: numcodecs is a dependency of zarr, so I did not add it to the depencies of rechunker, I hope this was the right thing to do.

maxime915 commented 2 years ago

I did not take the time to think carefully about this pull request : I did not consider whether this new type of array would work with the memory limitation feature of this package. This should probably have been an issue and not a pull request. Sorry for the inconvenience.