pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.09k forks source link

Notion of "distance" or "scale" for indexes and selection #5453

Open WardBrian opened 3 years ago

WardBrian commented 3 years ago

Is your feature request related to a problem? Please describe.

I've been using xarray with atmospheric data given on pressure levels. This data is best thought of in log(pressure) for computation, but it is stored and displayed as standard numbers. I would love if there was some way to have data.sel(lev=4, method='nearest') return the nearest data in log-space without having to have my index stored in the log space.

E.g, currently

>>> a = xr.DataArray(['a', 'b', 'c'], dims='lev', coords=[('lev', [0.1, 10, 100])])
>>> a.sel(lev=2, method='nearest')
<xarray.DataArray ()>
array('a', dtype='<U1')
Coordinates:
    lev      float64 0.1

but in the scientific sense underlying this data, the closer point was actually 'b' at 10, not 'a' at 0.1.

In general, one can imagine situations where the opposite is true (storing data in log-space for numerical accuracy, but wanting a concept of 'nearest' which is the standard linear sense), or a desire for arbitrary scaling.

Describe the solution you'd like The simplest solution I can imagine is to provide a preprocessor argument to the sel function which operates over numpy values and is used before the call to get_loc.

e.g.

>>> a.sel(lev=2, method='nearest', pre=np.log)
<xarray.DataArray ()>
array('b', dtype='<U1')
Coordinates:
    lev      float64 10.0

I believe this can be implemented by wrapping both index and label_value here with a call to the preprocess function (assuming the argument is only desired alongside the 'method' kwarg): https://github.com/pydata/xarray/blob/9daf9b13648c9a02bddee3640b80fe95ea1fff61/xarray/core/indexes.py#L224-L226

Describe alternatives you've considered I'm not sure how this would relate to the ongoing work on #1603, but one solution is to include a concept of the underlying number line within the index api. The end result is similar to the proposed implementation, but it would be stored with the index rather than passed to the sel method each time. This may keep the sel api simpler if this feature was only available for a special ScaledIndex class or something like that.

One version of this could also be used to set reasonable defaults when plotting, e.g. if a coordinate has a log numberline then it could set the x/yscale to 'log' by default when plotting over that coordinate.

WardBrian commented 3 months ago

It appears enough has been done in the explicit indexing refactor that this is now possible!

Using xarray 2024.6.0:

import xarray as xr
import numpy as np

# some example data
data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 100]})
ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))
print(ds)

# We create a class which is a subclass of xarray.Index
# https://docs.xarray.dev/en/latest/internals/how-to-create-custom-index.html

from xarray.core.indexes import PandasIndex, IndexSelResult, normalize_label, get_indexer_nd

# However, the built-in PandasIndex is _almost_ what we want,
# so we subclass it to steal most of the functionality
class LogIndexer(PandasIndex):
  def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)

  # Except for .sel, where we can inject our custom logic!
  def sel(self, labels: dict, method=None, tolerance=None) -> IndexSelResult:
    # based on the code in https://github.com/pydata/xarray/blob/ce5130f39d780cdce87366ee657665f4a5d3051d/xarray/core/indexes.py#L745
    if method == "nearest_log":
      assert len(labels) == 1
      coord_name, label = next(iter(labels.items()))
      label_array = normalize_label(label, dtype=self.coord_dtype)
      indexer = get_indexer_nd(np.log(self.index), np.log(label_array), method="nearest", tolerance=tolerance)
      return IndexSelResult({self.dim: indexer})
    return super().sel(labels, method, tolerance)

# Now we need to make a dataset that uses this custom index
ds_log = ds.drop_indexes('x').set_xindex('x', LogIndexer)
print(ds_log)

# x has values 10 and 100, to make it easy to observe the differences here:
print("normal:\t", ds_log.sel(x=[10,20,30,40,50,60], method="nearest").x.data)
print("log:\t", ds_log.sel(x=[10,20,30,40,50,60], method="nearest_log").x.data)

# prints:
# normal:  [ 10  10  10  10  10 100]
# log:     [ 10  10  10 100 100 100]
dcherian commented 3 months ago

I think we'd happily bundle that in Xarray, seems common enough.

cc @benbovy

WardBrian commented 3 months ago

To hew closer to my original suggestion, one could also accept a pointwise transform in the constructor, and then nearest could just apply that to both the index coordinates and the requested positions.

E.g., the set_xindex call would look like set_xindex("foo", PandasIndex, scale=np.log). This is similar to a suggestion given here: https://github.com/pydata/xarray/issues/7964#issuecomment-1642062405

This is more general (e.g. you could also do something like np.exp), but you lose the ability to use both nearest and nearest_log at the same time. For my own usages, either choice would be fine

benbovy commented 3 months ago

Yes this would be nice to have this functionality built-in Xarray.

Another option could be to have some TransformIndex meta-index that would basically wrap any Xarray index and apply an (inverse) user-defined transformation on label data before (after) executing the logic in the wrapped index. I.e., something very much like the unit-aware PintIndex implemented in https://github.com/xarray-contrib/pint-xarray/pull/163.

This is even more general and could be useful also for indexes implementing alignment with some tolerance (e.g., #4489), although that might seem a little far-fetched.

you lose the ability to use both nearest and nearest_log at the same time.

It is still possible to duplicate the coordinate so each have an index working on a different scale, e.g., "foo" and "foo_log").

WardBrian commented 3 months ago

I'd be happy to open a PR on either approach if there's a clear favorite

Edit: I may be misunderstanding, but it's not clear to me that a meta-index would work here, since the transformation also needs to be applied to the coordinate inside the index, not just the labels coming in to the sel call.

benbovy commented 3 months ago

Yes you are right, I didn't look carefully enough at your example and top comment.

WardBrian commented 3 months ago

Given that:

  1. Is there still interest in bundling the nearest_log functionality into xarray?
  2. Would it be preferred to be done as part of the implementation of PandasIndex, or as a wrapper class as above?
  3. Would it be preferred to implement this in such a way where the "log" nature is baked in, or would something like the set_xindex("foo", PandasIndex, scale=np.log) be preferred?
benbovy commented 3 months ago

It depends on whether we want to have this feature for free also in all indexes built on top of PandasIndex (most index types I've seen so far are either subclassing or encapsulating PandasIndex) vs. we want to keep PandasIndex as simple as possible (i.e., consistent with pandas and no extra option).

From a broader perspective it seems that the flexible distance / scale discussed here might be useful for both selection and alignment so maybe it is preferable to address it outside of PandasIndex? Something like a NearestIndex class in which we could later implement other "nearest" specific functionality like #4489 and add index options for tolerance, scale transform, etc.

I'm curious what others think, though!