pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.59k stars 1.08k forks source link

DataArray.argsort should be deleted #1635

Open seth-p opened 6 years ago

seth-p commented 6 years ago

Originally posted to https://groups.google.com/forum/#!topic/xarray/wsxeiIPLhgM

DataArray.argsort() appears to simply wrap the result of DataArray.values.argsort() in a same-shape DataArray. This is semantically nonsensical. If anything the index on the resulting argsort() values should simply be range(len(da)), but that adds little to the underlying numpy structure. And there's not much reason to have da.argsort() simply return the (raw) result of da.values.argsort(). So really DataArray.argsort() should simply be deleted.

On the other hand a new function DataArray.rank() that wraps da.values.argsort().argsort() (note repeated call to ndarray.argsort()) in the structure of the original DataArray would make sense, and perhaps even be useful... (Note that I'm not claiming that .argsort().argsort() is the fastest way to calculate this, but it's probably good enough, at least for an initial implementation.)

shoyer commented 6 years ago

It seems like another reasonable choice would be for DataArray.argsort() to keep its current value but drop coordinate labels along the sorting dimension. This would give a consistent result for use with broadcasting indexing (https://github.com/pydata/xarray/pull/1473).

I think this internal utility function is equivalent to your second argsort: https://github.com/pydata/xarray/blob/2949558b75a65404a500a237ec54834fd6946d07/xarray/core/nputils.py#L38-L55

In practice, we might use bottleneck.rankdata() or nanrankdata(): https://kwgoodman.github.io/bottleneck-doc/reference.html#non-reduce-with-axis

seth-p commented 6 years ago

I think that makes sense, though I don't quite understand what would go in its place. Another possibility -- perhaps a bad one -- is to permute the values in the sorted dimension so that they match the resulting values (i.e. something like result.coords[dim] = np.take(da.coords[dim].values, result.values, axis=axis)).

Note that ndarray.argsort(axis=None) sorts the flattened array, so the returned DataArray should respect this

Alternative suggestion: have DataArray.argsort() return an ndarray filled with labels from the sorted dimension, i.e. something like:

class DataArray:
    def argsort(self, **kwargs):
        # TODO: update kwargs['axis'] based 'axis' and 'dim', and remove 'dim'
        if kwargs['axis'] is None:
            kwargs['axis'] = -1
            return self.stack(dim=self.dims).argsort(**kwargs)
        return np.take(self.coords[self.dims[kwargs['axis']].values, self.values.argsort(**kwargs))

BTW, I'm just thinking in terms of ndarrays. Someone more knowledgeable than me may want to consider how to make it work intelligently with dask.

shoyer commented 6 years ago

Note that ndarray.argsort(axis=None) sorts the flattened array, so the returned DataArray should respect this

I'm not a huge fan of auto-flattening for xarray, but I can see this logic.

Alternative suggestion: have DataArray.argsort() return an ndarray filled with labels from the sorted dimension

I would probably implement a new method for this, maybe idxsort for symmetry with idxmax (https://github.com/pydata/xarray/issues/60). Though that name could be read several different ways. Maybe sortby_labels()?

seth-p commented 6 years ago

I'm not a fan of auto-flattening either, but that's what nd.argsort() does...

One option is to have DataArray.arg{min,max,sort}() all take an optional flag argument specifying whether to return integer indices or index labels. But I think my preference would be be to have six separate functions: DataArray.{idx,}arg{min,max,sort}() (or some such nomenclature that includes arg in all six functions).

stale[bot] commented 4 years ago

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

seth-p commented 4 years ago

I think this issue is still relevant.

anton-seaice commented 1 year ago

I agree that at least the index should show the reflect the order of the ranking found using argsort.


import xarray as xr
import numpy as np

np.random.rand(10)

ds = xr.DataArray(np.random.rand(10) )

ds.argsort().dim_0

returns array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

however it should return an array with the order of the elements ranked using argsort e.g.

array([3, 8, 5, 1, 7, 2, 9, 4, 6, 0])

Where ds[3] is the smallest value in ds and ds[0] is the biggest etc.

The current behaviour is confusing / nonsenical. Return the index unchanged implies that the initial array is already ordered.


INSTALLED VERSIONS
------------------
commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-425.3.1.el8.nci.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.9.3
netCDF4: 1.6.0
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: None
iris: 3.3.1
bottleneck: None
dask: 2022.11.0
distributed: 2022.11.0
matplotlib: 3.4.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2022.11.0
cupy: None
pint: None
sparse: 0.13.0
flox: None
numpy_groupies: None
setuptools: 65.5.1
pip: 23.0.1
conda: None
pytest: None
mypy: None
IPython: 8.6.0
sphinx: None