This saves some memory, but becomes slow for large number of groups.
We could instead do
def anom_vectorized(da):
mean = da.groupby("labels").mean()
mean_expanded = mean.sel(labels=da.labels)
anom = da - mean_expanded
return anom
Now we are faster, but construct an extra array as big as the original array (I think this is an OK tradeoff).
%timeit anom_current(da)
# 1.4 s ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit anom_vectorized(da)
# 937 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(I haven't experimented with dask yet, so the following is just a theory).
I think the real benefit comes with dask. Depending on where the groups are located relative to chunking, we could end up creating a lot of tiny chunks by splitting up existing chunks. With the vectorized approach we can do better.
Ideally we would reindex the "mean" dask array with a numpy-array-of-repeated-ints such that the chunking of mean_expanded exactly matches the chunking of da along the grouped dimension.
~In practice, dask.array.take doesn't allow specifying "output chunks" so we'd end up chunking "mean_expanded" based on dask's automatic heuristics, and then rechunking again for the binary operation.~
By switching to
numpy_groupies
we are vectorizing our groupby reductions. I think we can do the same for groupby's binary ops.Here's an example array
Consider this "anomaly" calculation, anomaly defined relative to the group mean
With this approach, we loop over each group and apply the binary operation: https://github.com/pydata/xarray/blob/a1635d324753588e353e4e747f6058936fa8cf1e/xarray/core/computation.py#L502-L525
This saves some memory, but becomes slow for large number of groups.
We could instead do
Now we are faster, but construct an extra array as big as the original array (I think this is an OK tradeoff).
(I haven't experimented with dask yet, so the following is just a theory).
I think the real benefit comes with dask. Depending on where the groups are located relative to chunking, we could end up creating a lot of tiny chunks by splitting up existing chunks. With the vectorized approach we can do better.
Ideally we would reindex the "mean" dask array with a numpy-array-of-repeated-ints such that the chunking of
mean_expanded
exactly matches the chunking ofda
along the grouped dimension.~In practice, dask.array.take doesn't allow specifying "output chunks" so we'd end up chunking "mean_expanded" based on dask's automatic heuristics, and then rechunking again for the binary operation.~
Thoughts?
cc @rabernat