pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Chunk too large for Blosc codec #98

Open aulemahal opened 3 years ago

aulemahal commented 3 years ago

Hi! Thanks for the very usefull package! I thinks I found a bug in the chunks choice mechanism:

My input dataset has shape (176, 226, 55115) with chunks (20, 20, 55115). The requested output chunks are (80, 60, 365). I allowed 3GB of max_mem, and there is a temp store.

Rechunking fails with : (elided traceback)

  File "/path/to/.conda/x38/lib/python3.9/site-packages/distributed/client.py", line 1813, in _gather
    raise exception.with_traceback(traceback)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/rechunker/pipeline.py", line 47, in _copy_chunk
    target[chunk_key] = data
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1213, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1308, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1599, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1651, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1888, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1893, in _chunk_setitem_nosync
    cdata = self._process_for_setitem(ckey, chunk_selection, value, fields=fields)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 1952, in _process_for_setitem
    return self._encode_chunk(chunk)
  File "/path/to/.conda/x38/lib/python3.9/site-packages/zarr/core.py", line 2009, in _encode_chunk
    cdata = self._compressor.encode(chunk)
  File "numcodecs/blosc.pyx", line 557, in numcodecs.blosc.Blosc.encode
  File "/path/to/.conda/x38/lib/python3.9/site-packages/numcodecs/compat.py", line 102, in ensure_contiguous_ndarray
    raise ValueError(msg)
ValueError: Codec does not support buffers of > 2147483647 bytes

Turns out 55115 176 226 = 2192254240 = 8 Go (float32) and is slightly over the number in the error message (by 2%). So I'm guessing rechunker is trying to put everything in a single chunk? Even though this is way above max_mem. Also, I never asked for Blosc encoding, so I guess it is automatic? Not a problem, but it seems a smaller chunk should be chosen in that case.

rabernat commented 3 years ago

Thanks for the bug report.

So I'm guessing rechunker is trying to put everything in a single chunk?

This should definitely not happen unless your total dataset size is < max_mem, which is not the case here.

I tried to reproduce your issue but could not

import zarr
from dask.diagnostics import ProgressBar
from rechunker import rechunk

shape = (176, 226, 55115)
source = zarr.ones(shape, chunks=(20, 20, 55115), dtype='f8', store='tmp-data/source.zarr', overwrite=True)
rechunked = rechunk(source, (80, 60, 365), '3GB', 'tmp-data/target.zarr',
                    target_options=dict(overwrite=True))
assert rechunked._intermediate is None  # no intermediate

with ProgressBar():
    rechunked.execute()

Could you share a bit more detail about your input data and the exact code your are using to call rechunker?