Open WardBrian opened 3 years ago
It appears enough has been done in the explicit indexing refactor that this is now possible!
Using xarray 2024.6.0:
import xarray as xr
import numpy as np
# some example data
data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 100]})
ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))
print(ds)
# We create a class which is a subclass of xarray.Index
# https://docs.xarray.dev/en/latest/internals/how-to-create-custom-index.html
from xarray.core.indexes import PandasIndex, IndexSelResult, normalize_label, get_indexer_nd
# However, the built-in PandasIndex is _almost_ what we want,
# so we subclass it to steal most of the functionality
class LogIndexer(PandasIndex):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Except for .sel, where we can inject our custom logic!
def sel(self, labels: dict, method=None, tolerance=None) -> IndexSelResult:
# based on the code in https://github.com/pydata/xarray/blob/ce5130f39d780cdce87366ee657665f4a5d3051d/xarray/core/indexes.py#L745
if method == "nearest_log":
assert len(labels) == 1
coord_name, label = next(iter(labels.items()))
label_array = normalize_label(label, dtype=self.coord_dtype)
indexer = get_indexer_nd(np.log(self.index), np.log(label_array), method="nearest", tolerance=tolerance)
return IndexSelResult({self.dim: indexer})
return super().sel(labels, method, tolerance)
# Now we need to make a dataset that uses this custom index
ds_log = ds.drop_indexes('x').set_xindex('x', LogIndexer)
print(ds_log)
# x has values 10 and 100, to make it easy to observe the differences here:
print("normal:\t", ds_log.sel(x=[10,20,30,40,50,60], method="nearest").x.data)
print("log:\t", ds_log.sel(x=[10,20,30,40,50,60], method="nearest_log").x.data)
# prints:
# normal: [ 10 10 10 10 10 100]
# log: [ 10 10 10 100 100 100]
I think we'd happily bundle that in Xarray, seems common enough.
cc @benbovy
To hew closer to my original suggestion, one could also accept a pointwise transform in the constructor, and then nearest
could just apply that to both the index coordinates and the requested positions.
E.g., the set_xindex
call would look like set_xindex("foo", PandasIndex, scale=np.log)
. This is similar to a suggestion given here: https://github.com/pydata/xarray/issues/7964#issuecomment-1642062405
This is more general (e.g. you could also do something like np.exp), but you lose the ability to use both nearest
and nearest_log
at the same time. For my own usages, either choice would be fine
Yes this would be nice to have this functionality built-in Xarray.
Another option could be to have some TransformIndex
meta-index that would basically wrap any Xarray index and apply an (inverse) user-defined transformation on label data before (after) executing the logic in the wrapped index. I.e., something very much like the unit-aware PintIndex
implemented in https://github.com/xarray-contrib/pint-xarray/pull/163.
This is even more general and could be useful also for indexes implementing alignment with some tolerance (e.g., #4489), although that might seem a little far-fetched.
you lose the ability to use both nearest and nearest_log at the same time.
It is still possible to duplicate the coordinate so each have an index working on a different scale, e.g., "foo" and "foo_log").
I'd be happy to open a PR on either approach if there's a clear favorite
Edit: I may be misunderstanding, but it's not clear to me that a meta-index would work here, since the transformation also needs to be applied to the coordinate inside the index, not just the labels coming in to the sel call.
Yes you are right, I didn't look carefully enough at your example and top comment.
Given that:
nearest_log
functionality into xarray?PandasIndex
, or as a wrapper class as above?set_xindex("foo", PandasIndex, scale=np.log)
be preferred?It depends on whether we want to have this feature for free also in all indexes built on top of PandasIndex (most index types I've seen so far are either subclassing or encapsulating PandasIndex) vs. we want to keep PandasIndex as simple as possible (i.e., consistent with pandas and no extra option).
From a broader perspective it seems that the flexible distance / scale discussed here might be useful for both selection and alignment so maybe it is preferable to address it outside of PandasIndex? Something like a NearestIndex
class in which we could later implement other "nearest" specific functionality like #4489 and add index options for tolerance, scale transform, etc.
I'm curious what others think, though!
Is your feature request related to a problem? Please describe.
I've been using xarray with atmospheric data given on pressure levels. This data is best thought of in log(pressure) for computation, but it is stored and displayed as standard numbers. I would love if there was some way to have
data.sel(lev=4, method='nearest')
return the nearest data in log-space without having to have my index stored in the log space.E.g, currently
but in the scientific sense underlying this data, the closer point was actually 'b' at 10, not 'a' at 0.1.
In general, one can imagine situations where the opposite is true (storing data in log-space for numerical accuracy, but wanting a concept of 'nearest' which is the standard linear sense), or a desire for arbitrary scaling.
Describe the solution you'd like The simplest solution I can imagine is to provide a preprocessor argument to the sel function which operates over numpy values and is used before the call to
get_loc
.e.g.
I believe this can be implemented by wrapping both index and label_value here with a call to the preprocess function (assuming the argument is only desired alongside the 'method' kwarg): https://github.com/pydata/xarray/blob/9daf9b13648c9a02bddee3640b80fe95ea1fff61/xarray/core/indexes.py#L224-L226
Describe alternatives you've considered I'm not sure how this would relate to the ongoing work on #1603, but one solution is to include a concept of the underlying number line within the index api. The end result is similar to the proposed implementation, but it would be stored with the index rather than passed to the
sel
method each time. This may keep thesel
api simpler if this feature was only available for a special ScaledIndex class or something like that.One version of this could also be used to set reasonable defaults when plotting, e.g. if a coordinate has a log numberline then it could set the x/yscale to 'log' by default when plotting over that coordinate.