Closed maxime915 closed 2 years ago
I did not take the time to think carefully about this pull request : I did not consider whether this new type of array would work with the memory limitation feature of this package. This should probably have been an issue and not a pull request. Sorry for the inconvenience.
Goal
Ragged arrays can hold data that would otherwise not fit in a traditional array. In mass spectrometry, for example, some scenarios lead to planar images with a variable number of sampling point across the (x, y) coordinates.
I am currently developing a way to convert mass spectrometry images to Zarr arrays and use a 2-stage conversion, where the image is first converted to a ragged array of chunks (1, 1) with no compressor and should then be converted to an equivalent array with other chunk/compressor parameters. The rechunker package would be a perfect fit for this task, but converting ragged array is currently not optimall.
Problem to solve
Consider this Zarr array :
That should be re-chunked to produce the following (or any other arbitrary chunk shape & compressor) :
The easy solution would be to use rechunker, with
target_chunks=(25, 40)
andtarget_options={'compressor': 'default'}
. However, running this code raises an exceptionValueError: missing object_codec for object array
(".../lib/python3.8/site-packages/zarr/storage.py", line 429, in_init_array_metadata
)The immediate solution would be to add the required option, giving
target_options={'compressor': 'default', 'object_codec': numcodecs.VLenArray(parser.intensityPrecision)}
. However, running this code raises an exceptionValueError: Zarr options must not include object_codec (got object_codec=VLenArray(dtype='<f4')) [...]
(in_validate_options
).Proposed solution
To rechunk a ragged array, it is necessary to pass the object_codec parameter with a value of type
numcodecs.vlen.VLenArray
. This would allow re-chunking of this type of arrays.The proposed changes allow using the
object_codec
parameters, but only with a value of typenumcodecs.vlen.VLenArray
. The goal of this restriction is to reduce the chance of any unintentional alteration of the library.Testing
I checked on a few Zarr ragged array and didn't have any issue.
The output of pytest for this package shows
139 passed, 41 skipped, 18 xfailed, 48 warnings in 24.93s
without any modification and139 passed, 41 skipped, 18 xfailed, 48 warnings in 24.91s
with the modifications from this pull request on my machine (Ubuntu 20.04).I did not add any test case to the package.
EDIT: numcodecs is a dependency of zarr, so I did not add it to the depencies of rechunker, I hope this was the right thing to do.