pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

pandas.Dataframe.interpolate() does not extrapolate even if it is asked to, depending on interpolation method #31949

Open typorian opened 4 years ago

typorian commented 4 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

a = pd.Series([0, 1, np.nan, 3, 4, np.nan, np.nan, np.nan, np.nan])
a_int=a.interpolate(method='cubic', limit_area=None)

Problem description

Some of the offered methods (it seems all of them that are provided by interp1d) are unable to extrapolate over np.nan. However, the limit_area switch for df.interpolate() indicates you can force extrapolation. A combination of limit_area=None and an incompatible method should raise a warning.

There used to be a similar issue where extrapolation over trailing NaN was done unintentionally, so maybe the fix for that overdid it. https://github.com/pandas-dev/pandas/issues/8000

Expected Output

Extrapolation over the NaNs in the array is expected. Using a different method, such as pchip achieves this.

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit : None python : 3.7.2.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : en LOCALE : None.None pandas : 0.25.3 (also tested with 1.0.0) numpy : 1.15.4 pytz : 2018.9 dateutil : 2.7.5 pip : 20.0.2 setuptools : 41.0.1 Cython : 0.29.15 pytest : None hypothesis : None sphinx : 1.8.3 blosc : None feather : None xlsxwriter : None lxml.etree : 4.3.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10 IPython : 7.5.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.3.3 matplotlib : 3.0.3 numexpr : None odfpy : None openpyxl : 2.5.12 pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.2.1 sqlalchemy : None tables : None xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : None
fercook commented 4 years ago

I second this.

Also, even when it works, it doesn't. The implied meaning of "extrapolate" is that it will continue on the last available trend. However, the observed result is that the last value is repeated.

In:

a = pd.Series([0, 1, np.nan, 3, 4, np.nan, np.nan, np.nan, np.nan])
a.interpolate(method='linear', limit_area=None)

Out:

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    4.0
6    4.0
7    4.0
8    4.0
RealJTG commented 4 years ago

I also stumbled on this bug.

Also examples in current documentation are confusing - extrapolation mentioned there "fill NaNs outside valid values (extrapolate)" https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html:

df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
                   (np.nan, 2.0, np.nan, np.nan),
                   (2.0, 3.0, np.nan, 9.0),
                   (np.nan, 4.0, -4.0, 16.0)],
                  columns=list('abcd'))
df
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  NaN  2.0  NaN   NaN
2  2.0  3.0  NaN   9.0
3  NaN  4.0 -4.0  16.0

df.interpolate(method='linear', limit_direction='forward', axis=0)
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  1.0  2.0 -2.0   5.0
2  2.0  3.0 -3.0   9.0
3  2.0  4.0 -4.0  16.0

... but this is not a linear extrapolation

     a
0  0.0
1  NaN
2  2.0
3  NaN

     a
0  0.0
1  1.0
2  2.0
3  2.0
Dr-Irv commented 4 years ago

Based on discussions in #8000 it seems we need an argument to specify that extrapolation at the beginning and end of the series can be specified. Alternatively, the docs could reflect that such extrapolation is not provided by interpolate

flxai commented 3 years ago

@Dr-Irv Is this still on the project's roadmap?

Dr-Irv commented 3 years ago

@Dr-Irv Is this still on the project's roadmap?

We are always open to a PR that would address this issue. For issues like this, we don't put them on a roadmap. We are just open to the community addressing them!

flxai commented 3 years ago

@Dr-Irv Thanks for the explanation. Unfortunately I am not familiar enough with Pandas to see myself making a real contribution. But here is the solution I am currently working with:

def extrapolate_linear(s):
    s = s.copy()
    # Indices of not-nan values
    idx_nn = s.index[~s.isna()]

    # At least two data points needed for trend analysis
    assert len(idx_nn) >= 2

    # Outermost indices
    idx_l = idx_nn[0]
    idx_r = idx_nn[-1]

    # Indices left and right of outermost values
    idx_ll = s.index[s.index < idx_l]
    idx_rr = s.index[s.index > idx_r]

    # Derivative of not-nan indices / values
    v = s[idx_nn].diff()

    # Left- and right-most derivative values
    v_l = v[1]
    v_r = v[-1]
    f_l = idx_l - idx_nn[1]
    f_r = idx_nn[-2] - idx_r

    # Set values left / right of boundaries
    l_l = lambda idx: (idx_l - idx) / f_l * v_l + s[idx_l]
    l_r = lambda idx: (idx_r - idx) / f_r * v_r + s[idx_r]
    x_l = pd.Series(idx_ll).apply(l_l)
    x_l.index = idx_ll
    x_r = pd.Series(idx_rr).apply(l_r)
    x_r.index = idx_rr
    s[idx_ll] = x_l
    s[idx_rr] = x_r

    return s

Exemplary usage is documented in this notebook. Hopefully this is inspiration enough for more competent people to make a PR.

EDIT: I've vectorized the function.

khaeru commented 3 years ago

Based on discussions in #8000 it seems we need an argument to specify that extrapolation at the beginning and end of the series can be specified. Alternatively, the docs could reflect that such extrapolation is not provided by interpolate.

I don't think that is a correct interpretation. The pandas docs specify that **kwargs are "Keyword arguments to pass on to the interpolating function." They also link directly to the docs for one such interpolating function, scipy.interpolate.interp1d. These in turn specify that one of these keyword arguments is fill_value and:

  • If “extrapolate”, then points outside the data range will be extrapolated. New in version 0.17.0.

So the following should work, with the current method signature and no new argument:

import numpy as np
import pandas as pd

# A 1-D Series with missing external values
x = [0.5, 1, 2, 3, 20]
y = [np.NaN, 1, 4, 9, np.NaN]
s = pd.Series(y, index=x)

# Expected usage
kw = dict(method="quadratic", fill_value="extrapolate")
s.interpolate(**kw)

But this fails. A value is extrapolated on the 'forward' end of s, but no value is extrapolated for x=0.5 on the 'backward' end:

0.5       NaN
1.0       1.0
2.0       4.0
3.0       9.0
20.0    400.0
dtype: float64

The code simply does not do what is advertised by the docs, so this is clearly (IMO) a bug. (Also a surprising one, since scipy 0.17 was 5 years ago, and one would think that pandas' use of basic features in numpy and scipy was stable and tested.)

A slightly more compact workaround than @flxai's:

from scipy.interpolate import interp1d

# A mask indicating where `s` is not null
m = s.notna()

# Construct an interpolator from the non-null values
# NB 'kind' instead of 'method'!
kw = dict(kind="quadratic", fill_value="extrapolate")
f = interp1d(s[m].index, s[m], **kw) 

# Apply this to the indices of the nulls; reconstruct a series
s2 = pd.Series(f(s[~m].index), index=s[~m].index)

# Fill `s` using the values from `s2`
result = s.fillna(s2)

# Previous 3 statements combined:
result = s.fillna(
    pd.Series(
        interp1d(s[m].index, s[m], **kw)(s[~m].index),
        index=s[~m].index,
    )
)

This gives the expected result:

0.5       0.25
1.0       1.00
2.0       4.00
3.0       9.00
20.0    400.00
dtype: float64
lyndonchan commented 3 years ago

@khaeru thanks for your code - I think I found a more elegant solution. As you mentioned, the extrapolation only works in the forward direction, so you can just flip the series, apply another extrapolation, and flip it back again.

Using your example (just one extrapolation):

import numpy as np
import pandas as pd

# A 1-D Series with missing external values
x = [0.5, 1, 2, 3, 20]
y = [np.NaN, 1, 4, 9, np.NaN]
s = pd.Series(y, index=x)

# Expected usage
kw = dict(method="quadratic", fill_value="extrapolate")
s.interpolate(**kw)

gives us this:

0.5       NaN
1.0       1.0
2.0       4.0
3.0       9.0
20.0    400.0
dtype: float64

But with my solution:

s.interpolate(**kw).iloc[::-1].interpolate(**kw).iloc[::-1]

we get this:

0.5       0.25
1.0       1.00
2.0       4.00
3.0       9.00
20.0    400.00
dtype: float64
khaeru commented 3 years ago

Sure, that also works! I don't know pandas' internals, so I can't guess whether your workaround, mine, or some other would perform best on large(r) series. People should consider their particular use-cases, and check/test.

Since this is a bug, these are only workarounds. The real ‘solution’ will be a PR by someone who knows the internals well enough to make one.

zhihua-zheng commented 3 years ago

@khaeru @lyndonchan To extrapolate in both directions, use limit_direction="both", which is not obvious at all.

import numpy as np
import pandas as pd

# A 1-D Series with missing external values
x = [0.5, 1, 2, 3, 20]
y = [np.NaN, 1, 4, 9, np.NaN]
s = pd.Series(y, index=x)

# Expected usage
kw = dict(method="quadratic", fill_value="extrapolate", limit_direction="both")
s.interpolate(**kw)

This gives:

0.5       0.25
1.0       1.00
2.0       4.00
3.0       9.00
20.0    400.00
dtype: float64
valschmidt commented 1 year ago

The method @zhihua-zheng provides, unfortunately does not work properly for method="time". Specifying the 'time' method will result in the first non-NaN value repeated to the beginning and the last non-NaN value repeated to the end, similar to what was shown by @RealJTG.

Here's an example:

import datetime
import pandas as pd

start=datetime.datetime.now()
dt = datetime.timedelta(1)

x = [start,
     start+dt,
     start+2*dt,
     start+3*dt,
     start+4*dt]
y = [np.NaN,2.0,np.NaN,3.0,np.NaN]

s = pd.Series(y,index=x)
kw = dict(method='time',limit_direction='both',fill_value="extrapolate")
s2 = s.interpolate(**kw)
print(s2)

2023-07-30 03:10:46.806929    2.0
2023-07-31 03:10:46.806929    2.0
2023-08-01 03:10:46.806929    2.5
2023-08-02 03:10:46.806929    3.0
2023-08-03 03:10:46.806929    3.0
dtype: float64
kmuehlbauer commented 1 year ago

Late to the party here, but we've experienced similar issues over at xarray.

One very obvious problem is that fill_value is broken for all interpolation schemes which use numpy.interp:

https://github.com/pandas-dev/pandas/blob/f00efd0344bd4e22cc867e76c776cb88669e6cde/pandas/core/missing.py#L504-L510

Although numpy.interp doesn't have a notion of fill_value, it has similar usable kwargs left and right. Unfortunately no kwargs are transported to numpy.interp.

One solution to fix this inconsistent behaviour is to either allow left/right kwargs to be transported to numpy.interp in the above code or use fill_value to set left/right (xarray is doing something along these lines).

~The second issue is that a proper use of left/right would be limited to either left or right depending on limit_direction (default forward) which can't be deactivated.~ With limit_direction="both" this would then work as expected.

As the numpy.interp issue is not directly connected to this issue here, I've opened a new bug report #55144.

joooeey commented 1 month ago

I second what others have said, the current behaviour is very surprising:

import numpy as np
import pandas as pd

# A 1-D Series with missing external values
x = [0.5, 1, 1.5, 2, 2.5, 3, 20]
y = [np.NaN, 1, np.NaN, 4, np.NaN, 9, np.NaN]
s = pd.Series(y, index=x)

# Expected usage
print(s.interpolate(method="index", limit_direction="both"))

This yields an utter mess:

0.5     1.0
1.0     1.0
1.5     2.5
2.0     4.0
2.5     6.5
3.0     9.0
20.0    9.0
dtype: float64

How is this a mess? Well it uses the linear interpolation that I specified for interpolation but for extrapolation it uses bfill and ffill instead of linear extrapolation.

I can work around this by using the following instead:

print(s.interpolate(method="slinear", limit_direction="both", fill_value="extrapolate"))

with output:

0.5     -0.5
1.0      1.0
1.5      2.5
2.0      4.0
2.5      6.5
3.0      9.0
20.0    94.0
dtype: float64

but this workaround was hard to find.

Either s.interpolate(method="index", limit_direction="both") or s.interpolate(method="slinear", limit_direction="both") should be enough to get the correct output (the dataframe just above).

joooeey commented 1 month ago

Furthermore, the default to extrapolate forward (limit_direction= kwarg) strikes me as totally arbitrary. IMHO no extrapolation should be the default as that is the safest (except for method="ffill" and method="bfill" which imply an extrapolation direction but we drop those in 3.0.0 as far as I can see which is a great idea since ffill is already accessible via Series.ffill and Series.interpolate(method="zero")).

joooeey commented 1 month ago

Is there a chance to get the changes I suggest (do the right thing for extrapolation and don't extrapolate by default) into 3.0.0? A major release would be required to clean up the API.

Is a PR welcome? Has someone already been working on parts of my suggestions?

joooeey commented 1 month ago

I notice that the workaround I mentioned above doesn't work if there are duplicate x-values:

import numpy as np
import pandas as pd

# A 1-D Series with missing external values
x = [0.5, 2, 2, 2.5, 3, 20]
y = [np.NaN, 1, 4, np.NaN, 9, np.NaN]
s = pd.Series(y, index=x)

# Expected usage
print(s.interpolate(method="index", limit_direction="both", fill_value="extrapolate"))

interpolates just fine but doesn't extrapolate:

0.5     1.0
2.0     1.0
2.0     4.0
2.5     6.5
3.0     9.0
20.0    9.0
dtype: float64

but delegating to scipy:

print(s.interpolate(method="slinear", limit_direction="both", fill_value="extrapolate"))

raises an error:

[...]
  File [censored]/python3.12/site-packages/scipy/interpolate/_bsplines.py:1385 in make_interp_spline
    raise ValueError("Expect x to not have duplicates")

ValueError: Expect x to not have duplicates