pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.61k stars 1.08k forks source link

{DataArray,Dataset}.rank() should support an optional list of dimensions #3810

Open seth-p opened 4 years ago

seth-p commented 4 years ago

{DataArray,Dataset}.rank() requires a single dim. Why not support an optional list of dimensions (defaulting to all)?

In [1]: import numpy as np, xarray as xr                                                                                                                                                                                                           

In [2]: d = xr.DataArray(np.arange(12).reshape((4,3)), dims=('abc', 'xyz'))                                                                                                                                                                        

In [3]: d                                                                                                                                                                                                                                          
Out[3]: 
<xarray.DataArray (abc: 4, xyz: 3)>
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
Dimensions without coordinates: abc, xyz

In [4]: d.rank()                                                                                                                                                                                                                                   
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-585571c1eca8> in <module>
----> 1 d.rank()

TypeError: rank() missing 1 required positional argument: 'dim'

In [5]: d.rank(dim=('xyz', 'abc'))                                                                                                                                                                                                                 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-006c73551ff8> in <module>
----> 1 d.rank(dim=('xyz', 'abc'))

~/.conda/envs/build/lib/python3.7/site-packages/xarray/core/dataarray.py in rank(self, dim, pct, keep_attrs)
   3054         """
   3055 
-> 3056         ds = self._to_temp_dataset().rank(dim, pct=pct, keep_attrs=keep_attrs)
   3057         return self._from_temp_dataset(ds)
   3058 

~/.conda/envs/build/lib/python3.7/site-packages/xarray/core/dataset.py in rank(self, dim, pct, keep_attrs)
   5295         """
   5296         if dim not in self.dims:
-> 5297             raise ValueError("Dataset does not contain the dimension: %s" % dim)
   5298 
   5299         variables = {}

TypeError: not all arguments converted during string formatting

In [6]: xr.show_versions()                                                                                                                                                                                                                         

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) 
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.3

xarray: 0.15.0
pandas: 1.0.1
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.11.0
distributed: 2.11.0
matplotlib: 3.1.3
cartopy: None
seaborn: 0.10.0
numbagg: installed
setuptools: 45.2.0.post20200209
pip: 20.0.2
conda: 4.8.2
pytest: None
IPython: 7.12.0
sphinx: None
max-sixty commented 4 years ago

This would be great. The underlying numerical library we use, bottleneck, doesn't support multiple dimensions. If there were another option, or someone wanted to write one in numbagg, that would be a welcome addition.

seth-p commented 4 years ago

Assuming dims is a non-empty list of dimensions, the following code seems to work:

    temp_dim = '__temp_dim__'
    return da.stack(**{temp_dim: dims}).\
        rank(temp_dim, pct=pct, keep_attrs=keep_attrs).\
        unstack(temp_dim).transpose(*da.dims).\
        drop_vars([dim_ for dim_ in dims if dim_ not in da.coords])
max-sixty commented 4 years ago

Yes, we can always reshape as a way of running numerical operations over multiple dimensions. But reshaping can be an expensive operation, so doing it as part of a numerical operation can cause surprises. (if you're interested, try running a sum over multiple dimensions and comparing to a reshape + a sum over the single reshaped dimension).

Instead, users can do this themselves, giving them context and control.

Reshaping is OK to do in groupby though (I think), so adding rank to groupby would be one way of accomplishing this.

seth-p commented 4 years ago

What's wrong with the following? (Still need to deal with pct and keep_attrs.)

apply_ufunc(
    bottleneck.{nan}rankdata,
    self,
    input_core_dims=[dims],
    output_core_dims=[dims],
    vectorize=True
)

Per https://kwgoodman.github.io/bottleneck-doc/reference.html#bottleneck.rankdata, "The default (axis=None) is to rank the elements of the flattened array."

max-sixty commented 4 years ago

Could you try running that?

seth-p commented 4 years ago

A few minor tweaks needed:

In [20]: import bottleneck        

In [21]: xr.apply_ufunc( 
    ...:     lambda x: bottleneck.rankdata(x).reshape(x.shape), 
    ...:     d, 
    ...:     input_core_dims=[['xyz', 'abc']], 
    ...:     output_core_dims=[['xyz', 'abc']], 
    ...:     vectorize=True 
    ...: ).transpose(*d.dims)                                                                                                                                                                                                                      
Out[21]: 
<xarray.DataArray (abc: 4, xyz: 3)>
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.],
       [ 7.,  8.,  9.],
       [10., 11., 12.]])
Dimensions without coordinates: abc, xyz

Despite what the docs say, bottleneck.{nan}rankdata(a) returns a 1-dimensional ndarray, not an array with the same shape as a.

max-sixty commented 4 years ago

Great -- that's cool and a good implementation of apply_ufunc. As above, we wouldn't want to replace rank with that given the reshaping (we'd need a function that computes over multiple dimensions)

We could use something similar for groupbys though?

seth-p commented 4 years ago

Note that with the apply_ufunc implementation we're only reshaping dims-sized ndarrays, not (necessarily) the whole DataArray, so maybe it's not too bad? It might be better to first sort dims to be in the same order as self.dims. i.e. dims = [dim_ for dim_ in self.dims if dim_ in dims]. But I'm just speculating.

max-sixty commented 4 years ago

Yeah, unfortunately I'm fairly confident about this; have a go with moderately large arrays for sum and you'll quickly see the performance cliff

josephnowak commented 2 years ago

Is it possible to add the option of modifying what happens when there is a tie in the rank? (If you want I can create a separate issue for this)

I think this can be done using the scipy rankdata function instead of the bottleneck rank (but also I think that adding the method option for the bottleneck package is also possible).

Small example:


arr = xarray.DataArray(
    dask.array.random.random((11, 10), chunks=(3, 2)),
    coords={'a': list(range(11)), 'b': list(range(10))}
)

def rank(x: xarray.DataArray, dim: str, method: str):
    # This option generate less tasks, I don't know why

    axis = x.dims.index(dim)
    return xarray.DataArray(
        dask.array.apply_along_axis(
            rankdata,
            axis,
            x.data,
            dtype=float,
            shape=(x.sizes[dim], ),
            method=method
        ),
        coords=x.coords,
        dims=x.dims
    )

def rank2(x: xarray.DataArray, dim: str, method: str):
    from scipy.stats import rankdata

    axis = x.dims.index(dim)
    return xarray.apply_ufunc(
        rankdata,
        x.chunk({dim: x.sizes[dim]}),
        dask='parallelized',
        kwargs={'method': method, 'axis': axis},
        meta=x.data._meta
    )

arr_rank1 = rank(arr, 'a', 'ordinal')
arr_rank2 = rank2(arr, 'a', 'ordinal')

assert arr_rank1.equals(arr_rank2)
# Probably this can work for ranking arrays with nan values
def _nanrankdata1(a, method):
    y = np.empty(a.shape, dtype=np.float64)
    y.fill(np.nan)
    idx = ~np.isnan(a)
    y[idx] = rankdata(a[idx], method=method)
    return y