pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

to_zarr fails for large dimensions; sensitive to exact dimension size and chunk size #6640

Closed JamiePringle closed 1 year ago

JamiePringle commented 2 years ago

What happened?

Using dask 2022.05.0, zarr 2.11.3 and xarray 2022.3.0, When creating a large empty dataset and trying to save it in the zarr data format with to_zarr, it fails with the following error. Frankly, I am not sure if the problem is with Xarray or Zarr, but as documented in the attached code, when I create the same dataset with Zarr, it works just fine.

File ~/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/zarr/core.py:2101, in Array._decode_chunk(self, cdata, start, nitems, expected_shape)
   2099 # ensure correct chunk shape
   2100 chunk = chunk.reshape(-1, order='A')
-> 2101 chunk = chunk.reshape(expected_shape or self._chunks, order=self._order)
   2103 return chunk

ValueError: cannot reshape array of size 234506 into shape (235150,)

To show that this is not a zarr issue, I have made the same output directly with zarr in the example code below. It is in the "else" clause in the code.

Note well: I have included a value of numberOfDrifters that has the problem, and one that does not. Please see the comments where numberOfDrifters is defined.

What did you expect to happen?

I expected a zarr dataset to be created. I cannot solve the problem with a chunk size of 1 for memory issues. I would prefer to create the zarr dataset with xarray so it has the metadata to be easily loaded into xarray.

Minimal Complete Verifiable Example

from numpy import *
import xarray as xr
import dask
import zarr

dtype=float32
chunkSize=10000
maxNumObs=1
#numberOfDrifters=120396431 #2008 This size WORKS
numberOfDrifters=120067029 #2007 This size FAILS

#if True, make zarr with xarray
if True:
    #make xarray data set, then write to zarr
    coords={'traj':(['traj'],arange(numberOfDrifters)),'obs':(['obs'],arange(maxNumObs))}
    emptyArray=dask.array.empty(shape=(numberOfDrifters,maxNumObs),dtype=dtype,chunks=(chunkSize,maxNumObs))
    var='time'
    data_vars={}
    attrs={}
    data_vars[var]=(['traj','obs'],emptyArray,attrs)
    dataOut=xr.Dataset(data_vars,coords,{})
    print('done defining data set, now writing')

    #now save to zarr dataset
    dataOut.to_zarr('dataPaths/jnk_makeWithXarray.zarr','w')
    print('done writing')
    zarrInXarray=zarr.open('dataPaths/jnk_makeWithXarray.zarr','r')
    print('done opening')
else:
    #make with zarr
    store=zarr.DirectoryStore('dataPaths/jnk_makeWithZarr.zarr')
    root=zarr.group(store=store)
    root.empty(shape=(numberOfDrifters,maxNumObs),name='time',dtype=dtype,chunks=(chunkSize,maxNumObs))
    print('done writting')
    zarrInZarr=zarr.open('dataPaths/jnk_makeWithZarr.zarr','r')
    print('done opening')

MVCE confirmation

Relevant log output

Traceback (most recent call last):
  File "/data/plumehome/pringle/workfiles/oceanparcels/makeCommunityConnectivity/breakXarray.py", line 26, in <module>
    dataOut.to_zarr('dataPaths/jnk_makeWithXarray.zarr','w')
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/xarray/core/dataset.py", line 2036, in to_zarr
    return to_zarr(
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/xarray/backends/api.py", line 1431, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/xarray/backends/api.py", line 1119, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/xarray/backends/zarr.py", line 534, in store
    self.set_variables(
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/xarray/backends/zarr.py", line 613, in set_variables
    writer.add(v.data, zarr_array, region)
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/xarray/backends/common.py", line 154, in add
    target[region] = source
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/zarr/core.py", line 1285, in __setitem__
    self.set_basic_selection(pure_selection, value, fields=fields)
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/zarr/core.py", line 1380, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/zarr/core.py", line 1680, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/zarr/core.py", line 1732, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/zarr/core.py", line 1994, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/zarr/core.py", line 1999, in _chunk_setitem_nosync
    cdata = self._process_for_setitem(ckey, chunk_selection, value, fields=fields)
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/zarr/core.py", line 2049, in _process_for_setitem
    chunk = self._decode_chunk(cdata)
  File "/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib/python3.9/site-packages/zarr/core.py", line 2101, in _decode_chunk
    chunk = chunk.reshape(expected_shape or self._chunks, order=self._order)
ValueError: cannot reshape array of size 234506 into shape (235150,)

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:22:55) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 5.13.0-41-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.1 numpy: 1.20.3 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.11.3 cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.05.0 distributed: 2022.5.0 matplotlib: 3.5.1 cartopy: 0.20.2 seaborn: None numbagg: None fsspec: 2022.02.0 cupy: None pint: None sparse: None setuptools: 61.2.0 pip: 22.0.4 conda: None pytest: 7.1.1 IPython: 8.2.0 sphinx: None
dcherian commented 2 years ago

Thanks @JamiePringle . it works for me with maxNumObs=1 (I assume the problem is not sensitive to this?) and

dask  : 2022.3.0
zarr  : 2.10.3
numpy : 1.21.5
xarray: main

Can you try downgrading zarr to 2.10.3? Does it still fail for you with maxNumObs=1? I don't see a commit that would've fixed this on xarray main... And I don't think dask is involved here.

JamiePringle commented 2 years ago

My apologies @dcherian, in commenting the code, I switched "FAILS" and "WORKS" -- the size that fails is numberOfDrifters=120067029

I have edited the example code above, and it should fail when run. I have made a test environment with the versions you suggested, and with maxNumObs=1, and it still fails with the same error.

Jamie

dcherian commented 2 years ago

Thanks Jamie.

Yes it now fails here with maxNumObs=1 and xarray main.

It looks like self._chunks is wrong but I don't know why.

JamiePringle commented 2 years ago

I have had a few other odd indexing issues with large arrays. It almost feels as if somewhere, the sizes are forced to be a fixed size integer or something.

On Tue, May 31, 2022 at 3:34 PM Deepak Cherian @.***> wrote:

CAUTION: This email originated from outside of the University System. Do not click links or open attachments unless you recognize the sender and know the content is safe.

CAUTION: This email originated from outside of the University System. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Thanks Jamie.

Yes it now fails here with maxNumObs=1 and xarray main.

It looks like self._chunks is wrong but I don't know why.

— Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/6640#issuecomment-1142565607, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBZR27HXRATD7CDBZR7GBDVMZSTFANCNFSM5XBK4B2A . You are receiving this because you were mentioned.Message ID: @.***>

max-sixty commented 1 year ago

This seems to pass on latest! So I'll close, but please reopen if there's still a failure on a different system