Open chrisroat opened 3 years ago
~Currently xarray requires known dimension sizes. Unless anyone has any insight about its interaction with dask that I'm not familiar with?~ Edit: better informed views below
this also came up in #4659 and dask/dask#6058. In #4659 we settled for computing the chunksizes for now since supporting unknown chunksizes seems like a bigger change.
There seems to be some support, but now you have me worried. I have a used xarray mainly for labelling, but not for much computation -- I'm dropping into dask because I need map_overlap.
FWIW, calling dask.compute(arr)
works with unknown chunk sizes, but now I see arr.compute()
does not. This fooled me into thinking I could use unknown chunk sizes. Now I see that writing to zarr does not work, either. This might torpedo my current design.
I see the compute_chunk_sizes
method, but that seems to trigger computation. I'm running on a dask cluster -- is there anything I can do to salvage the pattern arr_with_nan_shape.to_dataset().to_zarr(compute=False)
(with our without xarray)?
I'm not sure about writing to zarr but it seems possible to support nan-sized dimensions when unindexed. We could skip alignment when the dimension is nan-sized for all variables in an Xarray object.
~/kitchen_sync/xarray/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
283 for dim in obj.dims:
284 if dim not in exclude:
--> 285 all_coords[dim].append(obj.coords[dim])
286 try:
287 index = obj.indexes[dim]
For alignment, it may be as easy as adding the name of the nan-sized dimension to exclude
.
It may run even deeper -- there seem to be several checks on dimension sizes that would need special casing. Even simply doing a variable[dim] lookup fails!
Related: #2801
What happened:
When creating a dataset from two variables with a common dimension, there is a TypeError thrown when that dimension has shape nan.
What you expected to happen:
A dataset should be created. I believe dask has an
allow_unknown_chunksizes
parameter for cases like this -- would that be something that could work here? (Assuming I'm not making a mistake myself.)Minimal Complete Verifiable Example:
stack trace
``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/kitchen_sync/xarray/xarray/core/dataarray.py in _getitem_coord(self, key) 692 try: --> 693 var = self._coords[key] 694 except KeyError: KeyError: 'z' During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last)Anything else we need to know?:
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:12:38) [Clang 11.0.1 ] python-bits: 64 OS: Darwin OS-release: 20.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: None.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.17.1.dev66+g18ed29e4 pandas: 1.2.4 numpy: 1.20.2 scipy: 1.6.2 netCDF4: 1.5.6 pydap: installed h5netcdf: 0.10.0 h5py: 3.1.0 Nio: None zarr: 2.7.0 cftime: 1.4.1 nc_time_axis: 1.2.0 PseudoNetCDF: installed rasterio: None cfgrib: 0.9.9.0 iris: 2.4.0 bottleneck: 1.3.2 dask: 2021.04.0 distributed: 2021.04.0 matplotlib: 3.4.1 cartopy: 0.18.0 seaborn: 0.11.1 numbagg: installed pint: 0.17 setuptools: 49.6.0.post20210108 pip: 20.2.4 conda: None pytest: 6.2.3 IPython: 7.22.0 sphinx: None