pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.6k stars 1.08k forks source link

Interpolation behaviour inconsistent with numpy? #5664

Open mathisc opened 3 years ago

mathisc commented 3 years ago

Hey all, When running dataset.interp(time=dataset.time) fills with np.nan if one of the neighbor is a np.nan even when interpolation is not actually needed.

Here is the sample code to reproduce the issue :

def test_crop_times_nan() :
        ds = xr.Dataset(
            data_vars = {
                "some_variable" : (['x', 'time'], np.array([[np.nan, 0, 1]]))
            },
            coords = {
                "time" : np.array([0,1,2])
            }
        )
        result = ds.interp(time=ds.time)

        # result["some_variable"].value == [nan, nan, 1.0]
        # whereas [nan, 0, 1.0] is EXPECTED
        xr.testing.assert_allclose(ds, result)

Please note that numpy does not have the same behavior :

>>> import numpy as np
>>> np.interp([0,1,2], xp=[0,1,2], fp=[np.nan,0,1])
array([nan,  0.,  1.])

Is that an intended behaviour for xarray? If so, does this mean that I first have to check if an interpolation is needed instead of doing it no matter what (and use reindex instead of interp if it is not needed) ?
(this will be kind of tricky if interpolation is needed for certain values and some not...)

Thanks for your help ;)

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.5 (default, Jul 28 2020, 12:59:40) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.8.0-7642-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 0.18.2 pandas: 1.2.4 numpy: 1.19.4 scipy: 1.6.0 netCDF4: 1.5.6 pydap: None h5netcdf: 0.8.1 h5py: 3.1.0 Nio: None zarr: None cftime: 1.3.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.01.0 distributed: 2021.01.0 matplotlib: 3.4.2 cartopy: None seaborn: None numbagg: None pint: None setuptools: 57.4.0 pip: 20.2.4 conda: None pytest: None IPython: 7.19.0 sphinx: None
dcherian commented 3 years ago

Thanks @mathisc we use scipy by default and that's what it does...

In [6]: from scipy.interpolate import interp1d

In [7]: f = sp.interpolate.interp1d([0, 1, 2], [np.nan, 0, 1])

In [8]: f([0, 1, 2])
Out[8]: array([nan, nan,  1.])

Can you file an issue over at scipy?

EDIT: I guess we could skip interpolating when output coordinate values are a subset of the input coordinate values.

mathause commented 3 years ago

There does not seem a way to force the numpy interpolator for the 1D case:

https://github.com/pydata/xarray/blob/2f8623d4d9929b61ff4ebfd5232c12474420dfae/xarray/core/missing.py#L472

(that could be a work around)

mathisc commented 3 years ago

Thanks, Here is the ticket on Scipy's bug tracker : https://github.com/scipy/scipy/issues/14531

FWI this is what I ended up doing (which feels like work in both cases, interpolation needed or not):

interpolated_ds =  ds.interp(time=ds.times)
reindexed_ds = ds.reindex(time=ds.times)
ds = interpolated_ds.fillna(reindexed_ds)

It is probably not the most efficient in terms of operation but it gets the job done.

Thanks for your feedback ;)

scottyhq commented 7 months ago

Interestingly this was fixed upstream according to the linked issue above, but the behavior persists in Xarray:

ds.interp(time=ds.time)
# array([[nan, nan,  1.]]) # (unexpected)

ds.interpolate_na(time=ds.time)
# array([nan,  0.,  1.])

from scipy.interpolate import interp1d
f = interp1d([0, 1, 2], [np.nan, 0, 1])
f([0, 1, 2])  
# array([nan,  0.,  1.])
xarray: 2024.2.0
pandas: 2.2.1
numpy: 1.26.3
scipy: 1.12.0