pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
162 stars 25 forks source link

TypeError when rechunking a group #118

Closed DamienIrving closed 2 years ago

DamienIrving commented 2 years ago

I'm rechunking a dataset that I ultimately want to open with xarray, so I was following the rechunk a group part of the online tutorial as follows (using rechunker version 0.5.0):

import xarray as xr
import zarr
from rechunker import rechunk

ds = xr.tutorial.open_dataset("air_temperature")
ds = ds.chunk({'time': 100})
ds.air.encoding = {}

ds.to_zarr('air_temperature.zarr')
source_group = zarr.open('air_temperature.zarr')
source_array = source_group['air']

target_chunks_dict = {
    'air': {'time': 2920, 'lat': 25, 'lon': 1},
    'time': None,
    'lon': None,
    'lat': None,
}
max_mem = '1MB'
target_store = 'air_rechunked.zarr'
temp_store = 'air_rechunked-tmp.zarr'
array_plan = rechunk(source_array, target_chunks_dict, max_mem, target_store, temp_store=temp_store)
TypeError                                 Traceback (most recent call last)
Input In [23], in <cell line: 10>()
      8 target_store = 'air_rechunked.zarr'
      9 temp_store = 'air_rechunked-tmp.zarr'
---> 10 array_plan = rechunk(source_array, target_chunks_dict, max_mem, target_store, temp_store=temp_store)

File /g/data/xv83/dbi599/miniconda3/envs/agcd/lib/python3.10/site-packages/rechunker/api.py:302, in rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options, executor)
    299 if isinstance(executor, str):
    300     executor = _get_executor(executor)
--> 302 copy_spec, intermediate, target = _setup_rechunk(
    303     source=source,
    304     target_chunks=target_chunks,
    305     max_mem=max_mem,
    306     target_store=target_store,
    307     target_options=target_options,
    308     temp_store=temp_store,
    309     temp_options=temp_options,
    310 )
    311 plan = executor.prepare_plan(copy_spec)
    312 return Rechunked(executor, plan, source, intermediate, target)

File /g/data/xv83/dbi599/miniconda3/envs/agcd/lib/python3.10/site-packages/rechunker/api.py:472, in _setup_rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options)
    468     return copy_specs, temp_group, target_group
    470 elif isinstance(source, (zarr.core.Array, dask.array.Array)):
--> 472     copy_spec = _setup_array_rechunk(
    473         source,
    474         target_chunks,
    475         max_mem,
    476         target_store,
    477         target_options=target_options,
    478         temp_store_or_group=temp_store,
    479         temp_options=temp_options,
    480     )
    481     intermediate = copy_spec.intermediate.array
    482     target = copy_spec.write.array

File /g/data/xv83/dbi599/miniconda3/envs/agcd/lib/python3.10/site-packages/rechunker/api.py:531, in _setup_array_rechunk(source_array, target_chunks, max_mem, target_store_or_group, target_options, temp_store_or_group, temp_options, name)
    529 # don't consolidate reads for Dask arrays
    530 consolidate_reads = isinstance(source_array, zarr.core.Array)
--> 531 read_chunks, int_chunks, write_chunks = rechunking_plan(
    532     shape,
    533     source_chunks,
    534     target_chunks,
    535     itemsize,
    536     max_mem,
    537     consolidate_reads=consolidate_reads,
    538 )
    540 # create target
    541 shape = tuple(int(x) for x in shape)  # ensure python ints for serialization

File /g/data/xv83/dbi599/miniconda3/envs/agcd/lib/python3.10/site-packages/rechunker/algorithm.py:117, in rechunking_plan(shape, source_chunks, target_chunks, itemsize, max_mem, consolidate_reads, consolidate_writes)
    114     raise ValueError(f"target_chunks {target_chunks} must have length {ndim}")
    116 source_chunk_mem = itemsize * prod(source_chunks)
--> 117 target_chunk_mem = itemsize * prod(target_chunks)
    119 if source_chunk_mem > max_mem:
    120     raise ValueError(
    121         f"Source chunk memory ({source_chunk_mem}) exceeds max_mem ({max_mem})"
    122     )

File /g/data/xv83/dbi599/miniconda3/envs/agcd/lib/python3.10/site-packages/rechunker/compat.py:11, in prod(iterable)
      8 try:
      9     from math import prod as mathprod  # type: ignore # Python 3.8
---> 11     return mathprod(iterable)
     12 except ImportError:
     13     return reduce(operator.mul, iterable, 1)

TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

It looks like it isn't happy with the 'time': None, 'lon': None, 'lat': None in the target chunks dictionary?

rabernat commented 2 years ago

Thanks for reporting this Damian! I agree this looks like a bug. I am guessing the problem is that you do have the dimension specified as chunked here

'air': {'time': 2920, 'lat': 25, 'lon': 1},

but not here

    'time': None,
    'lon': None,
    'lat': None,

and that is what is confusing rechunker.

To work around this, for now you might try specifying the chunks as tuples (rather than dicts), i.e.

'air': (2920, 25, 1)  # assuming dim order is time, lat, lon 
rabernat commented 2 years ago

Ok my previous suggestion was a red herring. I think I found the issue. You want to rechunk a group. But you are passing an array to rechunk. Try

- array_plan = rechunk(source_array, target_chunks_dict, max_mem, target_store, temp_store=temp_store)
+ array_plan = rechunk(source_group, target_chunks_dict, max_mem, target_store, temp_store=temp_store)

That worked for me!

Here is the relevant line from the tutorial:

array_plan = rechunk(source_group, target_chunks, max_mem, target_store, temp_store=temp_store)

I suppose it is confusing to call it array_plan when actually we are rechunking a group.

DamienIrving commented 2 years ago

Thanks, @rabernat! That fixes it. My apologies - I should have spotted that typo.