Open pschlo opened 1 month ago
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
Looks like this was introduced by dask==2024.8
and the new shuffle
algorithm. If run with an earlier version the example gives a single 100,010 chunk for the expanded variables, but now gives size 10 chunking over the whole array. Ideally we want chunks=((10, 20000, 20000, ...),)
right?
Looks like this was introduced by
dask==2024.8
and the newshuffle
algorithm. If run with an earlier version the example gives a single 100,010 chunk for the expanded variables, but now gives size 10 chunking over the whole array. Ideally we wantchunks=((10, 20000, 20000, ...),)
right?
This does indeed seem more reasonable. I also tested this with an earlier version but found it to be very slow as well. I guess that shuffling is required for the more general concat()
case, but in the mentioned case, it should be easy to compute, no? I wrote the following function to circumvent this in my project:
import xarray as xr
import dask.array as da
import numpy as np
from collections.abc import Collection
def xr_prefill_concat(datasets: Collection[xr.Dataset], dim: str, *args, **kwargs):
"""Concatenate Dask Datasets by first ensuring they all have the same data_vars"""
datasets = [ds.copy() for ds in datasets]
def fill_vars(ds: xr.Dataset, vars: set[str]):
missing_vars = vars - set(ds.data_vars)
if not missing_vars:
return
# use chunk size of any data variable
dataarr_chunksizes = next(iter(ds.data_vars.values())).chunksizes
if not dataarr_chunksizes:
raise ValueError("Dataset must be backed by Dask")
chunk_size = dataarr_chunksizes[dim][0]
for var in missing_vars:
ds[var] = (dim, da_wrap.full((ds.sizes[dim],), np.nan, chunks=chunk_size))
all_vars: set[str] = set().union(*(ds.data_vars for ds in datasets))
for ds in datasets:
fill_vars(ds, all_vars)
return xr.concat(datasets, dim=dim, *args, **kwargs)
cc @phofl
Thanks for the ping, will take a look
Looking at this again, I'm not sure we can do much to choose the "right" chunksizes here. Each variable is treated independently, so when concatenating var2
we don't have any context to do any different.
Also I can't reproduce, the example takes 1.5s on my machine.
Would something like a chunk size hint on the dask side help here? (Note: this might not be a viable suggestion, haven’t had the chance to look yet)
What is your issue?
Given the following situation:
dim1
, backed by Daskdim1
, backed by DaskWhen I
concat()
them alongdim1
, xarray extends the variables that appear in the first Dataset but not in the second Dataset withNaN
. I would expect this to be lazy and to execute almost instantly, but it turns out to be very slow on my machine.Example code:
Output:
The last output line is followed by many more
10
s.This takes about 10-20 seconds to run on my machine. Is there any reason for this being so slow? I would've expected the code to execute almost instantly, such that the
NaN
chunks are being added lazily, e.g. upon callingcompute()
.Here is my output of
xr.show_versions()
: