pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Zarr chunks would overlap multiple dask chunks error #5286

Closed eric-czech closed 3 years ago

eric-czech commented 3 years ago

Would it be possible to get an explanation on how this situation results in a zarr chunk overlapping multiple dask chunks?

This code below is generating an array with 2 chunks, selecting one row from each chunk, and then writing that resulting two row array back to zarr. I don't see how it's possible in this case for one zarr chunk to correspond to different dask chunks. There are clearly two resulting dask chunks, two input zarr chunks, and a correspondence between them that should be 1 to 1 ... what does this error message really mean then?

import xarray as xr
import dask.array as da

ds = xr.Dataset(dict(
    x=(('a', 'b'), da.ones(shape=(10, 10), chunks=(5, 10))),
)).assign(a=list(range(10)))
ds
# <xarray.Dataset>
# Dimensions:  (a: 10, b: 10)
# Coordinates:
#   * a        (a) int64 0 1 2 3 4 5 6 7 8 9
# Dimensions without coordinates: b
# Data variables:
#     x        (a, b) float64 dask.array<chunksize=(5, 10), meta=np.ndarray>

# Write the dataset out
!rm -rf /tmp/test.zarr
ds.to_zarr('/tmp/test.zarr')

# Read it back in, subset to 1 record in two different chunks (two rows total), write back out
!rm -rf /tmp/test2.zarr
xr.open_zarr('/tmp/test.zarr').sel(a=[0, 11]).to_zarr('/tmp/test2.zarr')
# NotImplementedError: Specified zarr chunks encoding['chunks']=(5, 10) for variable named 'x' would overlap multiple dask chunks ((1, 1), (10,)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

Also what is the difference between "deleting or modifying encoding['chunks']" and "specify safe_chunks=False"? That wasn't clear to me in https://github.com/pydata/xarray/issues/5056.

Lastly and most importantly, can data be corrupted when using parallel zarr writes and just deleting encoding['chunks'] in these situations?

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 4.19.0-16-cloud-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: None libnetcdf: None xarray: 0.18.0 pandas: 1.2.4 numpy: 1.20.2 scipy: 1.6.3 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.8.1 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.04.1 distributed: 2021.04.1 matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20210108 pip: 21.1.1 conda: None pytest: 6.2.4 IPython: 7.23.1 sphinx: None
shoyer commented 3 years ago

The short answer is that you should delete encoding['chunks'] here for now. This operation is totally safe, but Xarray is mistakenly trying to re-use the chunks from the source dataset when writing the data to disk.

The long answer is that we really should fix this in Xarray. See https://github.com/pydata/xarray/issues/5219 for discussion.

eric-czech commented 3 years ago

Thanks @shoyer, good to know!

max-sixty commented 3 years ago

I'll close this now that it's linked to #5219. Thanks @eric-czech & @shoyer