pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

ContainsArrayError: path 'lon' contains an array #78

Closed pl-marasco closed 3 years ago

pl-marasco commented 3 years ago

A little bit of contest:

I have a dataset of 768 NetCDF stored as single files; each of them isn't chunked (time:1, lat:15680 lon:40320).

What I'm trying to achieve:

Issue: When I try to create the plan I get back this error ContainsArrayError: path 'lon' contains an array

Gist: https://gist.github.com/pl-marasco/f6e1bf9f3a0f87ce028fc68735ab25fa

TomAugspurger commented 3 years ago

IIRC, this typically happens when there's already files at target_store or temp_store.

Can you clear those directories prior to calling rechunk?

pl-marasco commented 3 years ago

Unfortunatelly isn't this the case, files are removed before rechunk.

rabernat commented 3 years ago

Could you post the full traceback of your error?

pl-marasco commented 3 years ago
---------------------------------------------------------------------------

ContainsArrayError                        Traceback (most recent call last)

<ipython-input-9-2e6f94a3cc43> in <module>
----> 1 array_plan = rechunk(ds, target_chunks, mem_max, target_store,temp_store=temp_store)
      2 

~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options, executor)
    294             )
    295 
--> 296     copy_spec, intermediate, target = _setup_rechunk(
    297         source=source,
    298         target_chunks=target_chunks,

~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options)
    373             variable_attrs[DIMENSION_KEY] = encode_zarr_attr_value(variable.dims)
    374 
--> 375             copy_spec = _setup_array_rechunk(
    376                 dask.array.asarray(variable),
    377                 variable_chunks,

~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_array_rechunk(source_array, target_chunks, max_mem, target_store_or_group, target_options, temp_store_or_group, temp_options, name)
    493     write_chunks = tuple(int(x) for x in write_chunks)
    494 
--> 495     target_array = _zarr_empty(
    496         shape,
    497         target_store_or_group,

~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _zarr_empty(shape, store_or_group, chunks, dtype, name, **kwargs)
    149     if name is not None:
    150         assert isinstance(store_or_group, zarr.hierarchy.Group)
--> 151         return store_or_group.empty(
    152             name, shape=shape, chunks=chunks, dtype=dtype, **kwargs
    153         )

~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\hierarchy.py in empty(self, name, **kwargs)
    899         """Create an array. Keyword arguments as per
    900         :func:`zarr.creation.empty`."""
--> 901         return self._write_op(self._empty_nosync, name, **kwargs)
    902 
    903     def _empty_nosync(self, name, **kwargs):

~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\hierarchy.py in _write_op(self, f, *args, **kwargs)
    659 
    660         with lock:
--> 661             return f(*args, **kwargs)
    662 
    663     def create_group(self, name, overwrite=False):

~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\hierarchy.py in _empty_nosync(self, name, **kwargs)
    905         kwargs.setdefault('synchronizer', self._synchronizer)
    906         kwargs.setdefault('cache_attrs', self.attrs.cache)
--> 907         return empty(store=self._store, path=path, chunk_store=self._chunk_store,
    908                      **kwargs)
    909 

~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\creation.py in empty(shape, **kwargs)
    225 
    226     """
--> 227     return create(shape=shape, fill_value=None, **kwargs)
    228 
    229 

~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, **kwargs)
    119 
    120     # initialize array metadata
--> 121     init_array(store, shape=shape, chunks=chunks, dtype=dtype, compressor=compressor,
    122                fill_value=fill_value, order=order, overwrite=overwrite, path=path,
    123                chunk_store=chunk_store, filters=filters, object_codec=object_codec)

~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\storage.py in init_array(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    342     _require_parent_group(path, store=store, chunk_store=chunk_store, overwrite=overwrite)
    343 
--> 344     _init_array_metadata(store, shape=shape, chunks=chunks, dtype=dtype,
    345                          compressor=compressor, fill_value=fill_value,
    346                          order=order, overwrite=overwrite, path=path,

~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    371             rmdir(chunk_store, path)
    372     elif contains_array(store, path):
--> 373         raise ContainsArrayError(path)
    374     elif contains_group(store, path):
    375         raise ContainsGroupError(path)

ContainsArrayError: path 'lon' contains an array`
pl-marasco commented 3 years ago

I don't know if this can help but time to time (I still have to understand better when it happens) I get this error too

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-10-2e6f94a3cc43> in <module>
----> 1 array_plan = rechunk(ds, target_chunks, mem_max, target_store,temp_store=temp_store)
      2 

~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options, executor)
    294             )
    295 
--> 296     copy_spec, intermediate, target = _setup_rechunk(
    297         source=source,
    298         target_chunks=target_chunks,

~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options)
    373             variable_attrs[DIMENSION_KEY] = encode_zarr_attr_value(variable.dims)
    374 
--> 375             copy_spec = _setup_array_rechunk(
    376                 dask.array.asarray(variable),
    377                 variable_chunks,

~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_array_rechunk(source_array, target_chunks, max_mem, target_store_or_group, target_options, temp_store_or_group, temp_options, name)
    464 
    465     if isinstance(target_chunks, dict):
--> 466         array_dims = _get_dims_from_zarr_array(source_array)
    467         try:
    468             target_chunks = _shape_dict_to_tuple(array_dims, target_chunks)

~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _get_dims_from_zarr_array(z_array)
    138     # use Xarray convention
    139     # http://xarray.pydata.org/en/stable/internals.html#zarr-encoding-specification
--> 140     return z_array.attrs["_ARRAY_DIMENSIONS"]
    141 
    142 

AttributeError: 'Array' object has no attribute 'attrs'
rabernat commented 3 years ago

So the traceback definitely suggests that Zarr thinks there is already an array at the location of the target store. Just to completely rule this out, could you add something like this to your code just before calling rechunk

import os
print(os.listdir(target_store))
rabernat commented 3 years ago

The second error you posted should only be possible if you are rechunking from a Zarr array source (not an Xarray dataset). Does it arise from the same code you shared via gist above?

I'm confused by the fact that you are reporting two distinct errors in the same issue. For the same code, do you always get the same error? Or does it vary at random?

pl-marasco commented 3 years ago

Yes, is coming from the same code. I put the second as the two errors are presented (pass me the term) randomly. I tested to remove the attributes and adding to the target_chunks the line:

'attrs': None

I still not have the solution and it jumps from one error to the other without any comprehensible reason to me.

About the emptiness

print(os.listdir(target_store))

FileNotFoundError Traceback (most recent call last)

in ----> 1 print(os.listdir(target_store)) 2 FileNotFoundError: [WinError 3] Impossibile trovare il percorso specificato: 'c:/data/tmp/NDVI_GLOBAL.zarr' If you need some file to make some tests you can download from here: https://land.copernicus.vgt.vito.be/manifest/ndvi_v2_1km/manifest_cgls_ndvi_v2_1km_latest.txt
rabernat commented 3 years ago

I'm sorry for your frustration. This is extremely puzzling to me. In particular, the randomness / intermittency of the problem makes it very hard to debug.

Guessing has not worked, so what we will need to do is try to craft a minimal reproducible bug report which can reproduce the same errors, ideally without using your many TB of actual data, but rather with synthetic data that are small and simple.

Could you share the full output of print(ds)?

pl-marasco commented 3 years ago

Let's try to solve the second problem and eventually I can reproduce in a more stable way the first one.

here a more structured MRBR

ds = xr.tutorial.load_dataset("rasm")

target_chunks = {
        'Tair': {'time': 36, 'lat': 50, 'lon': 50},
        'time': None,
        'lat': None,
        'lon': None}

mem_max = '8GB'
target_store = './output.zarr'
temp_store = './temp_store.zarr'
! rm -rf ./*.zarr

array_plan = rechunk(ds, target_chunks, mem_max, target_store,temp_store=temp_store)

Traceback

AttributeError Traceback (most recent call last) in ----> 1 array_plan = rechunk(ds, target_chunks, mem_max, target_store,temp_store=temp_store) ~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options, executor) 294 ) 295 --> 296 copy_spec, intermediate, target = _setup_rechunk( 297 source=source, 298 target_chunks=target_chunks, ~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options) 373 variable_attrs[DIMENSION_KEY] = encode_zarr_attr_value(variable.dims) 374 --> 375 copy_spec = _setup_array_rechunk( 376 dask.array.asarray(variable), 377 variable_chunks, ~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_array_rechunk(source_array, target_chunks, max_mem, target_store_or_group, target_options, temp_store_or_group, temp_options, name) 464 465 if isinstance(target_chunks, dict): --> 466 array_dims = _get_dims_from_zarr_array(source_array) 467 try: 468 target_chunks = _shape_dict_to_tuple(array_dims, target_chunks) ~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _get_dims_from_zarr_array(z_array) 138 # use Xarray convention 139 # http://xarray.pydata.org/en/stable/internals.html#zarr-encoding-specification --> 140 return z_array.attrs["_ARRAY_DIMENSIONS"] 141 142 AttributeError: 'Array' object has no attribute 'attrs'

That the wrong assumption is the presence of the _ARRAY_DIMENSIONS; as this input isn't a Xarray converted to Zarr there is no attribute defined and the system fails. I've tested as well a conversion to a .zarr and a reingestion but doesn't seem to fix the problem.

rabernat commented 3 years ago

Your example did not work for me, but in a different way

import xarray as xr
from rechunker import rechunk

ds = xr.tutorial.load_dataset("rasm")

target_chunks = {
        'Tair': {'time': 36, 'lat': 50, 'lon': 50},
        'time': None,
        'lat': None,
        'lon': None}

mem_max = '8GB'
target_store = './output.zarr'
temp_store = './temp_store.zarr'
! rm -rf ./*.zarr

array_plan = rechunk(ds, target_chunks, mem_max, target_store,temp_store=temp_store)

I get KeyError: 'y'. The problem is that lon and lat are not dimensions on this dataset.

If I change target_chunks as follows

target_chunks = {
        'Tair': {'time': 36, 'y': 50, 'x': 50},
        'time': None,
        'lat': None,
        'lon': None}

...then the example runs with no error in my dev environment.

So somehow, in your environment, it does not realize that the input is an xarray dataset.

Can you share your rechunker version?

import rechunker
rechunker.__version__
rabernat commented 3 years ago

Ok I have determined that this is a version of the bug in https://github.com/pangeo-data/rechunker/issues/59#issuecomment-718840912. It was fixed in #72. Basically, in your version, you can only specify chunks as a tuple, i.e. 'Tair': (36, 50, 50) not 'Tair': {'time': 36, 'y': 50, 'x': 50}.

I just released v0.3.3 to pypi, so you could try upgrading to see if that fixes your problem.

pl-marasco commented 3 years ago

Solved! as well with the original dataset that I was using. Now I'm able to create the plan. Tnx