Open seth-p opened 4 years ago
This would be great. The underlying numerical library we use, bottleneck, doesn't support multiple dimensions. If there were another option, or someone wanted to write one in numbagg, that would be a welcome addition.
Assuming dims
is a non-empty list of dimensions, the following code seems to work:
temp_dim = '__temp_dim__'
return da.stack(**{temp_dim: dims}).\
rank(temp_dim, pct=pct, keep_attrs=keep_attrs).\
unstack(temp_dim).transpose(*da.dims).\
drop_vars([dim_ for dim_ in dims if dim_ not in da.coords])
Yes, we can always reshape as a way of running numerical operations over multiple dimensions. But reshaping can be an expensive operation, so doing it as part of a numerical operation can cause surprises. (if you're interested, try running a sum over multiple dimensions and comparing to a reshape + a sum over the single reshaped dimension).
Instead, users can do this themselves, giving them context and control.
Reshaping is OK to do in groupby
though (I think), so adding rank
to groupby would be one way of accomplishing this.
What's wrong with the following? (Still need to deal with pct
and keep_attrs
.)
apply_ufunc(
bottleneck.{nan}rankdata,
self,
input_core_dims=[dims],
output_core_dims=[dims],
vectorize=True
)
Per https://kwgoodman.github.io/bottleneck-doc/reference.html#bottleneck.rankdata, "The default (axis=None) is to rank the elements of the flattened array."
Could you try running that?
A few minor tweaks needed:
In [20]: import bottleneck
In [21]: xr.apply_ufunc(
...: lambda x: bottleneck.rankdata(x).reshape(x.shape),
...: d,
...: input_core_dims=[['xyz', 'abc']],
...: output_core_dims=[['xyz', 'abc']],
...: vectorize=True
...: ).transpose(*d.dims)
Out[21]:
<xarray.DataArray (abc: 4, xyz: 3)>
array([[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., 9.],
[10., 11., 12.]])
Dimensions without coordinates: abc, xyz
Despite what the docs say, bottleneck.{nan}rankdata(a)
returns a 1-dimensional ndarray, not an array with the same shape as a
.
Great -- that's cool and a good implementation of apply_ufunc
. As above, we wouldn't want to replace rank
with that given the reshaping (we'd need a function that computes over multiple dimensions)
We could use something similar for groupbys though?
Note that with the apply_ufunc
implementation we're only reshaping dims
-sized ndarray
s, not (necessarily) the whole DataArray, so maybe it's not too bad? It might be better to first sort dims
to be in the same order as self.dims
. i.e. dims = [dim_ for dim_ in self.dims if dim_ in dims]
. But I'm just speculating.
Yeah, unfortunately I'm fairly confident about this; have a go with moderately large arrays for sum
and you'll quickly see the performance cliff
Is it possible to add the option of modifying what happens when there is a tie in the rank? (If you want I can create a separate issue for this)
I think this can be done using the scipy rankdata function instead of the bottleneck rank (but also I think that adding the method option for the bottleneck package is also possible).
Small example:
arr = xarray.DataArray(
dask.array.random.random((11, 10), chunks=(3, 2)),
coords={'a': list(range(11)), 'b': list(range(10))}
)
def rank(x: xarray.DataArray, dim: str, method: str):
# This option generate less tasks, I don't know why
axis = x.dims.index(dim)
return xarray.DataArray(
dask.array.apply_along_axis(
rankdata,
axis,
x.data,
dtype=float,
shape=(x.sizes[dim], ),
method=method
),
coords=x.coords,
dims=x.dims
)
def rank2(x: xarray.DataArray, dim: str, method: str):
from scipy.stats import rankdata
axis = x.dims.index(dim)
return xarray.apply_ufunc(
rankdata,
x.chunk({dim: x.sizes[dim]}),
dask='parallelized',
kwargs={'method': method, 'axis': axis},
meta=x.data._meta
)
arr_rank1 = rank(arr, 'a', 'ordinal')
arr_rank2 = rank2(arr, 'a', 'ordinal')
assert arr_rank1.equals(arr_rank2)
# Probably this can work for ranking arrays with nan values
def _nanrankdata1(a, method):
y = np.empty(a.shape, dtype=np.float64)
y.fill(np.nan)
idx = ~np.isnan(a)
y[idx] = rankdata(a[idx], method=method)
return y
{DataArray,Dataset}.rank()
requires a singledim
. Why not support an optional list of dimensions (defaulting to all)?