Open seth-p opened 6 years ago
It seems like another reasonable choice would be for DataArray.argsort()
to keep its current value but drop coordinate labels along the sorting dimension. This would give a consistent result for use with broadcasting indexing (https://github.com/pydata/xarray/pull/1473).
I think this internal utility function is equivalent to your second argsort: https://github.com/pydata/xarray/blob/2949558b75a65404a500a237ec54834fd6946d07/xarray/core/nputils.py#L38-L55
In practice, we might use bottleneck.rankdata()
or nanrankdata()
:
https://kwgoodman.github.io/bottleneck-doc/reference.html#non-reduce-with-axis
I think that makes sense, though I don't quite understand what would go in its place. Another possibility -- perhaps a bad one -- is to permute the values in the sorted dimension so that they match the resulting values (i.e. something like result.coords[dim] = np.take(da.coords[dim].values, result.values, axis=axis)
).
Note that ndarray.argsort(axis=None)
sorts the flattened array, so the returned DataArray
should respect this
Alternative suggestion: have DataArray.argsort()
return an ndarray
filled with labels from the sorted dimension, i.e. something like:
class DataArray:
def argsort(self, **kwargs):
# TODO: update kwargs['axis'] based 'axis' and 'dim', and remove 'dim'
if kwargs['axis'] is None:
kwargs['axis'] = -1
return self.stack(dim=self.dims).argsort(**kwargs)
return np.take(self.coords[self.dims[kwargs['axis']].values, self.values.argsort(**kwargs))
BTW, I'm just thinking in terms of ndarray
s. Someone more knowledgeable than me may want to consider how to make it work intelligently with dask.
Note that ndarray.argsort(axis=None) sorts the flattened array, so the returned DataArray should respect this
I'm not a huge fan of auto-flattening for xarray, but I can see this logic.
Alternative suggestion: have DataArray.argsort() return an ndarray filled with labels from the sorted dimension
I would probably implement a new method for this, maybe idxsort
for symmetry with idxmax
(https://github.com/pydata/xarray/issues/60). Though that name could be read several different ways. Maybe sortby_labels()
?
I'm not a fan of auto-flattening either, but that's what nd.argsort()
does...
One option is to have DataArray.arg{min,max,sort}()
all take an optional flag argument specifying whether to return integer indices or index labels. But I think my preference would be be to have six separate functions: DataArray.{idx,}arg{min,max,sort}()
(or some such nomenclature that includes arg
in all six functions).
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here or remove the stale
label; otherwise it will be marked as closed automatically
I think this issue is still relevant.
I agree that at least the index should show the reflect the order of the ranking found using argsort.
import xarray as xr
import numpy as np
np.random.rand(10)
ds = xr.DataArray(np.random.rand(10) )
ds.argsort().dim_0
returns
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
however it should return an array with the order of the elements ranked using argsort e.g.
array([3, 8, 5, 1, 7, 2, 9, 4, 6, 0])
Where ds[3] is the smallest value in ds and ds[0] is the biggest etc.
The current behaviour is confusing / nonsenical. Return the index unchanged implies that the initial array is already ordered.
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-425.3.1.el8.nci.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1
xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.9.3
netCDF4: 1.6.0
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: None
iris: 3.3.1
bottleneck: None
dask: 2022.11.0
distributed: 2022.11.0
matplotlib: 3.4.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2022.11.0
cupy: None
pint: None
sparse: 0.13.0
flox: None
numpy_groupies: None
setuptools: 65.5.1
pip: 23.0.1
conda: None
pytest: None
mypy: None
IPython: 8.6.0
sphinx: None
Originally posted to https://groups.google.com/forum/#!topic/xarray/wsxeiIPLhgM
DataArray.argsort()
appears to simply wrap the result ofDataArray.values.argsort()
in a same-shapeDataArray
. This is semantically nonsensical. If anything the index on the resultingargsort()
values should simply berange(len(da))
, but that adds little to the underlying numpy structure. And there's not much reason to haveda.argsort()
simply return the (raw) result ofda.values.argsort()
. So reallyDataArray.argsort()
should simply be deleted.On the other hand a new function
DataArray.rank()
that wrapsda.values.argsort().argsort()
(note repeated call tondarray.argsort()
) in the structure of the originalDataArray
would make sense, and perhaps even be useful... (Note that I'm not claiming that.argsort().argsort()
is the fastest way to calculate this, but it's probably good enough, at least for an initial implementation.)