pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

`xr.save_mfdataset()` doesn't honor `compute=False` argument #4209

Open andersy005 opened 4 years ago

andersy005 commented 4 years ago

What happened:

While using xr.save_mfdataset() function with compute=False I noticed that the function returns a dask.delayed object, but it doesn't actually defer the computation i.e. it actually writes datasets right away.

What you expected to happen:

I expect the datasets to be written when I explicitly call .compute() on the returned delayed object.

Minimal Complete Verifiable Example:

In [2]: import xarray as xr

In [3]: ds = xr.tutorial.open_dataset('rasm', chunks={})

In [4]: ds
Out[4]:
<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
    yc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 dask.array<chunksize=(36, 205, 275), meta=np.ndarray>
Attributes:
    title:                     /workspace/jhamman/processed/R1002RBRxaaa01a/l...
    institution:               U.W.
    source:                    RACM R1002RBRxaaa01a
    output_frequency:          daily
    output_mode:               averaged
    convention:                CF-1.4
    references:                Based on the initial model of Liang et al., 19...
    comment:                   Output from the Variable Infiltration Capacity...
    nco_openmp_thread_number:  1
    NCO:                       "4.6.0"
    history:                   Tue Dec 27 14:15:22 2016: ncatted -a dimension...

In [5]: path = "test.nc"

In [7]: ls -ltrh test.nc
ls: cannot access test.nc: No such file or directory

In [8]: tasks = xr.save_mfdataset(datasets=[ds], paths=[path], compute=False)

In [9]: tasks
Out[9]: Delayed('list-aa0b52e0-e909-4e65-849f-74526d137542')

In [10]: ls -ltrh test.nc
-rw-r--r-- 1 abanihi ncar 14K Jul  8 10:29 test.nc

Anything else we need to know?:

Environment:

Output of xr.show_versions() ```python INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 3.10.0-693.21.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.4 xarray: 0.15.1 pandas: 0.25.3 numpy: 1.18.5 scipy: 1.5.0 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: None cftime: 1.2.0 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.20.0 distributed: 2.20.0 matplotlib: 3.2.1 cartopy: None seaborn: None numbagg: None setuptools: 49.1.0.post20200704 pip: 20.1.1 conda: None pytest: None IPython: 7.16.1 sphinx: None ```
shoyer commented 4 years ago

The way compute=False currently works may be a little confusing. It doesn't actually delay creating files, it just delays writing the array data.

andersy005 commented 4 years ago

The way compute=False currently works may be a little confusing. It doesn't actually delay creating files, it just delays writing the array data.

Interesting... I always assumed that all operations (including file creation) were delayed. So, this is a feature and not a bug then?

shoyer commented 4 years ago

The way compute=False currently works may be a little confusing. It doesn't actually delay creating files, it just delays writing the array data.

Interesting... I always assumed that all operations (including file creation) were delayed. So, this is a feature and not a bug then?

Well, there is certainly a case for file creation also being lazy -- it definitely is more intuitive! This was more of an oversight than an intentional omission. Metadata generally needs to be written from a single process, anyways, so we never got around to doing it with Dask.

That said, there are also some legitimate use cases where it is nice to be able to eagerly write metadata only without any array data. This is what we were proposing to do with compute=False in to_zarr: https://github.com/pydata/xarray/pull/4035

dcherian commented 4 years ago

Here's an alternative map_blocks solution:

def write_block(ds, t0):
    if len(ds.time) > 0:
        fname = (ds.time[0] - t0).values.astype("timedelta64[h]").astype(int)
        ds.to_netcdf(f"temp/file-{fname:06d}.nc")

    # dummy return
    return ds.time

ds = xr.tutorial.open_dataset("air_temperature", chunks={"time": 100})
ds.map_blocks(write_block, kwargs=dict(t0=ds.time[0])).compute(scheduler="processes")

There are two workarounds here though.

  1. The user function always has to return something.
  2. We can't provide template=ds.time because it has no chunk information and ds.time.chunk({"time": 100}) silently does nothing because it is an IndexVariable. So the user function still needs the len(ds.time) > 0 workaround.

I think a cleaner API may be to have dask.compute([write_block(block) for block in ds.to_delayed()]) where ds.to_delayed() yields a bunch of tasks; each of which gives a Dataset wrapping one block of the underlying array.