pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.55k stars 1.07k forks source link

Can't call open_mfdataset without creating chunked dask arrays #9038

Open TomNicholas opened 3 months ago

TomNicholas commented 3 months ago

What happened?

Passing chunks=None to xr.open_dataset/open_mfdataset is supposed to avoid using dask at all, returning lazily-indexed numpy arrays even if dask is installed. However chunks=None doesn't currently work for xr.open_mfdataset as it gets silently coerced internally to chunks={}, which creates dask chunks aligned with the on-disk files.

Offending line of code: https://github.com/pydata/xarray/blob/12123be8608f3a90c6a6d8a1cdc4845126492364/xarray/backends/api.py#L1040

What did you expect to happen?

Passing chunks=None to open_mfdataset should return lazily-indexed numpy arrays, like open_dataset does.

Minimal Complete Verifiable Example

ds = xr.tutorial.open_dataset("air_temperature")

ds1 = ds.isel(time=slice(None, 1000))
ds2 = ds.isel(time=slice(1000, None))

ds1.to_netcdf('air1.nc')
ds2.to_netcdf('air2.nc')

combined = xr.open_mfdataset(['air1.nc', 'air2.nc'], chunks=None)

print(type(combined['air'].data))

MVCE confirmation

Relevant log output

dask.array.core.Array

Anything else we need to know?

As the default is None, changing this without changing the default would be a breaking change. But the current behaviour is also not intended.

Environment

main

dcherian commented 3 months ago

Passing chunks=None to open_mfdataset should return lazily-indexed numpy arrays, like open_dataset does.

Can't do this without virtual concat machinery (https://github.com/pydata/xarray/issues/4628) which someone decided to implement elsewhere 🙄 ;)

We could change the default to chunks={} in anticipation though.

TomNicholas commented 3 months ago

Can't do this without virtual concat machinery (https://github.com/pydata/xarray/issues/4628) which someone decided to implement elsewhere 🙄 ;)

😅

It's still broken at the moment though - I had a (ridiculous) case where I don't care that concat will load everything in memory, I just want to completely avoid creating dask.array objects, and right now there is no possible input option to open_mfdataset to do that.

We could change the default to chunks={} in anticipation though.

That's probably more useful, as well as actually being consistent.

Illviljan commented 3 months ago

See #5704 for changing chunks={} and more discussion.