pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.51k stars 1.05k forks source link

preserve chunked data when creating DataArray from itself #5983

Closed FabianHofmann closed 2 years ago

FabianHofmann commented 2 years ago

What happened:

When creating a new DataArray from a DataArray with chunked data, the underlying dask array is converted to a numpy array.

What you expected to happen:

I expected the underlying dask array to be preseved when creating a new DataArray instance.

Minimal Complete Verifiable Example:

import xarray as xr
import numpy as np
from dask import array

d = np.ones((10, 10))
x = array.from_array(d, chunks=5)

da = xr.DataArray(x) # this is chunked
xr.DataArray(da) # this is not chunked anymore

Anything else we need to know?:

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.11.0-40-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.19.0 pandas: 1.3.3 numpy: 1.20.3 scipy: 1.7.1 netCDF4: 1.5.6 pydap: None h5netcdf: 0.11.0 h5py: 3.2.1 Nio: None zarr: 2.10.1 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.6 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.09.1 distributed: 2021.09.1 matplotlib: 3.4.3 cartopy: 0.19.0.post1 seaborn: 0.11.2 numbagg: None pint: None setuptools: 58.0.4 pip: 21.2.4 conda: 4.10.3 pytest: 6.2.5 IPython: 7.27.0 sphinx: 4.2.0
dcherian commented 2 years ago

Can you give us a little more context about why this might be useful? IIRC we disallowed creating dataarrays from dataarrays in some other place because it leads to ambiguous situations like the following

xr.DataArray(da, attrs={"a": 1})  # does the result have da.attrs or the provided attrs?
FabianHofmann commented 2 years ago

Ah yes, this is indeed ambiguous. On the other hand, as long it is still supported to create DataArray's from DataArray's they should at least preserve the data format. I need this as I am creating a subclass from the xarray.DataArray (see https://github.com/PyPSA/linopy/blob/8ac34d9fdbddc1fec0c7b4781f3d49e9c5ae064e/linopy/constraints.py#L18). In case I want to convert a lazy DataArray to my custom class the chunked data is directly computed, which seems a bit weird...

dcherian commented 2 years ago

IMO we should raise an error asking the user to pass da.data instead

FabianHofmann commented 2 years ago

Not sure, but I'd argue to keep the DataArray-from-self-construction as I imagine many convenience cases where arrays maybe DataArray, numpy arrays or dask arrays, and one wants to ensure a DataArray type. Many other packages like pandas/numpy have that. Also the xarray.Dataset supports from-self-construction.

Perhaps it is better to raise an error when ambiguities occur? Meaning don't allowing to pass attrs, coords when data is an DataArray...