Hardcoded `consolidate_reads` causes error for large array

lsetiawan commented 3 years ago

Overview

When using zarr group as a source rather than xarray dataset, rechunking exceeds max_buffer_size for blosc codec.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-122-55c76b7b8980> in <module>
      1 with ProgressBar():
----> 2     array_plan.execute()
...

numcodecs/blosc.pyx in numcodecs.blosc.Blosc.encode()

/opt/conda/envs/app-env/lib/python3.7/site-packages/numcodecs/compat.py in ensure_contiguous_ndarray(buf, max_buffer_size)
    100     if max_buffer_size is not None and arr.nbytes > max_buffer_size:
    101         msg = "Codec does not support buffers of > {} bytes".format(max_buffer_size)
--> 102         raise ValueError(msg)
    103 
    104     return arr

ValueError: Codec does not support buffers of > 2147483647 bytes

Investigation

Looking through the rechunker code, it appears that if the source array is zarr.Array it consolidates the whole array and use that as the chunk to read. This is problematic because zarr array can be very big! For example, one of my array is 18021459720 bytes unchunked which is greater than the zarr.Blosc.max_buffer_size of 2147483647 bytes.

https://github.com/pangeo-data/rechunker/blob/15f7e31cf46296882d958566dfc62e91ce0d41b7/rechunker/api.py#L480-L481

Questions and comments

Is there a reason for this hardcoding? Is there a way to allow the reading of the original chunk from the source rather than the whole array?

Thank you in advance for your help!

rabernat commented 3 years ago

Welcome @lsetiawan and thanks for opening a thoughtful issue.

it consolidates the whole array and use that as the chunk to read.

consolidate_reads == True does not mean that the entire source array becomes a single chunk. It means that that the source array is read using chunks that are larger than its native chunks, up to the size specified by max_mem. This is step 3 of the rechunker algorithm.

This block is where the read chunk size is determined:

https://github.com/pangeo-data/rechunker/blob/15f7e31cf46296882d958566dfc62e91ce0d41b7/rechunker/algorithm.py#L133-L150

In my understanding, it is not possible for rechunker's consolidation of chunks to trigger your blosc error, since it cannot actually alter the underlying chunks of the zarr array. Consolidating read chunks just means reading many zarr chunks within a single copy task. Instead, I think that your error is coming from deep inside zarr, when attempting to decode a single chunk.

I would start by verifying that you can simply read all the chunks of your input array, leaving rechunker out of the picture.

If that does not identify you problem, we will have to dig deeper into rechunker, ideally by developing a minimal complete verifiable example.

lsetiawan commented 3 years ago

Hi @rabernat,

Thank you for the explanation. That makes sense, it seems like max_mem was actually my issue. I had allowed workers to use 16GB memory, but seems like this clashes with the zarr Blocsc codec max_buffer_size of ~2GB, so changing my max_mem to 2GB seemed to have solved the problem.

read using chunks that are larger than its native chunks, up to the size specified by max_mem

Digging through the copy_spec I can see now that it just created chunks that are the closest to the max_mem. So when I had the 16GB, the read chunk is the whole array since it's smaller than 16GB!

I think having access to copy_spec is very valuable, but was very difficult to get to. Do you know of an easier way to get to this other than accessing the underlying private functions?

Thanks again for your insight! I really appreciate your time. :smile:

rabernat commented 3 years ago

I think having access to copy_spec is very valuable, but was very difficult to get to. Do you know of an easier way to get to this other than accessing the underlying private functions?

It would be great to put copy_spec into the Rechunked object. We can add copy_spec to the __init__ method

https://github.com/pangeo-data/rechunker/blob/c59f303ea541e098406b2e225ed0cc73999e562f/rechunker/api.py#L46

...and pass it through here:

https://github.com/pangeo-data/rechunker/blob/c59f303ea541e098406b2e225ed0cc73999e562f/rechunker/api.py#L306

Would love to see a PR for this! 😉

rabernat commented 3 years ago

@lsetiawan - would be great if you were interested in adding copy_spec to the Rechunked object. (Would help in debugging #80 for example.) I'm going to close this, since the original issue seems resolved.

pangeo-data / rechunker