pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.64k stars 1.09k forks source link

Chunking causes unrelated non-dimension coordinate to become a dask array #4205

Open chrisroat opened 4 years ago

chrisroat commented 4 years ago

What happened:

Rechunking along an independent dimension causes unrelated non-dimension coordinates to become dask arrays. The dimension coordinates do not seem affected.

I can stick in a synchronous compute on the coordinate to recover, but wanted to be sure this was the expected behavior.

What you expected to happen:

Chunking along an unrelated dimension should not affect unrelated non-dimension coordinates.

Minimal Complete Verifiable Example:

import xarray as xr
import dask.array as da

def print_coords(a, title):
    print()
    print(title)
    for dim in ['x', 'y', 'b']:
        if dim in a.dims or dim in a.coords:
            print('dim:', dim, 'type:', type(a.coords[dim].data))

arr = xr.DataArray(da.zeros((20, 20), chunks=10), dims=('x', 'y'), 
                   coords={'b': ('y', range(100,120)), 
                           'x': range(20), 
                           'y': range(20)})

print_coords(arr, 'Original')

# The following line rechunks independently of b or y.
# Removing this line allows the code to succeed.
arr = arr.chunk({'x': 5})

print_coords(arr, 'After chunking')

arr = arr.sel(y=2)

print_coords(arr, 'After selection')

print()
print('Scalar values:')
print('y=', arr.coords['y'].item())
print('b=', arr.coords['b'].item())  # Sad Panda
Original
dim: x type: <class 'numpy.ndarray'>
dim: y type: <class 'numpy.ndarray'>
dim: b type: <class 'numpy.ndarray'>

After chunking
dim: x type: <class 'numpy.ndarray'>
dim: y type: <class 'numpy.ndarray'>
dim: b type: <class 'dask.array.core.Array'>

After selection
dim: x type: <class 'numpy.ndarray'>
dim: y type: <class 'numpy.ndarray'>
dim: b type: <class 'dask.array.core.Array'>

Scalar values:
y= 2

<stack trace elided>
NotImplementedError: 'item' is not yet a valid method on dask arrays

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.19.112+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: None xarray: 0.15.1 pandas: 1.0.5 numpy: 1.18.5 scipy: 1.4.1 netCDF4: None pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.19.0 distributed: 2.19.0 matplotlib: 3.2.2 cartopy: None seaborn: None numbagg: None setuptools: 49.1.0.post20200704 pip: 20.1.1 conda: 4.8.3 pytest: 5.4.3 IPython: 7.16.1 sphinx: None
aulemahal commented 2 years ago

I have the same problem in xarray 2022.3.0. The issue is that this creates unnecessary dask tasks in the graph and some operations acting on the coordinates unexpectedly trigger dask computations. "Unexpected" because the coordinates at the beginning of the process where not chunked. So computation that was expected to happen in the main thread (or not happen at all) is now happenning in the dask workers.

An example:

import numpy as np
import xarray as xr
from dask.diagnostics import ProgressBar

# A 2D variable
da = xr.DataArray(
    np.ones((12, 10)),
    dims=('x', 'y'),
    coords={'x': np.arange(12), 'y': np.arange(10)}
 )

# A 1D variable sharing a dim with da
db = xr.DataArray(
    np.ones((12,)),
    dims=('x'),
    coords={'x': np.arange(12)}
)

# A non-dimension coordinate
cx = xr.DataArray(np.zeros((12,)), dims=('x',), coords={'x': np.arange(12)})

# Assign it to da and db
da = da.assign_coords(cx=cx)
db = db.assign_coords(cx=cx)

# We need to chunk along y
da = da.chunk({'y': 1})

# Notice how `cx` is now a dask array, even if it is a 1D coordinate and does not have 'Y' as a dimension.
print(da)

# This triggers a dask computation
with ProgressBar():
    da - db

The reason my example triggers dask is that xarray ensure the coordinates are aligned and equal (I think?). Anyway, I didn't expect it.

Personally, I think the chunk method shouldn't apply to the coordinates at all, no matter their dimensions. They're coordinate so we expect to be able to read them easily when aligning/comparing dataset. Dask is to be used with the "real" data only. Does this vision fit the one from the devs? I feel this "skip" could be easily implemented.

dcherian commented 2 years ago

It makes sense to me that chunking along a dimension dim should not chunk variables that don't have that dimension.

@shoyer what do you think