Open andersy005 opened 4 years ago
The way compute=False
currently works may be a little confusing. It doesn't actually delay creating files, it just delays writing the array data.
The way
compute=False
currently works may be a little confusing. It doesn't actually delay creating files, it just delays writing the array data.
Interesting... I always assumed that all operations (including file creation) were delayed. So, this is a feature and not a bug then?
The way
compute=False
currently works may be a little confusing. It doesn't actually delay creating files, it just delays writing the array data.Interesting... I always assumed that all operations (including file creation) were delayed. So, this is a feature and not a bug then?
Well, there is certainly a case for file creation also being lazy -- it definitely is more intuitive! This was more of an oversight than an intentional omission. Metadata generally needs to be written from a single process, anyways, so we never got around to doing it with Dask.
That said, there are also some legitimate use cases where it is nice to be able to eagerly write metadata only without any array data. This is what we were proposing to do with compute=False
in to_zarr
:
https://github.com/pydata/xarray/pull/4035
Here's an alternative map_blocks
solution:
def write_block(ds, t0):
if len(ds.time) > 0:
fname = (ds.time[0] - t0).values.astype("timedelta64[h]").astype(int)
ds.to_netcdf(f"temp/file-{fname:06d}.nc")
# dummy return
return ds.time
ds = xr.tutorial.open_dataset("air_temperature", chunks={"time": 100})
ds.map_blocks(write_block, kwargs=dict(t0=ds.time[0])).compute(scheduler="processes")
There are two workarounds here though.
template=ds.time
because it has no chunk information and ds.time.chunk({"time": 100})
silently does nothing because it is an IndexVariable. So the user function still needs the len(ds.time) > 0
workaround.I think a cleaner API may be to have dask.compute([write_block(block) for block in ds.to_delayed()])
where ds.to_delayed()
yields a bunch of tasks; each of which gives a Dataset wrapping one block of the underlying array.
What happened:
While using
xr.save_mfdataset()
function withcompute=False
I noticed that the function returns adask.delayed
object, but it doesn't actually defer the computation i.e. it actually writes datasets right away.What you expected to happen:
I expect the datasets to be written when I explicitly call
.compute()
on the returned delayed object.Minimal Complete Verifiable Example:
Anything else we need to know?:
Environment:
Output of xr.show_versions()
```python INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 3.10.0-693.21.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.4 xarray: 0.15.1 pandas: 0.25.3 numpy: 1.18.5 scipy: 1.5.0 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: None cftime: 1.2.0 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.20.0 distributed: 2.20.0 matplotlib: 3.2.1 cartopy: None seaborn: None numbagg: None setuptools: 49.1.0.post20200704 pip: 20.1.1 conda: None pytest: None IPython: 7.16.1 sphinx: None ```