pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

BUG: In `main`, using `resample().interpolate(inplace=True)` raises an exception #58690

Open Dr-Irv opened 6 months ago

Dr-Irv commented 6 months ago

Pandas version checks

Reproducible Example

import numpy as np
from pandas import date_range, DataFrame
DR = date_range("1999-1-1", periods=365, freq="D")
DF_ = DataFrame(np.random.standard_normal((365, 1)), index=DR)
S = DF_.iloc[:, 0]
DF = DataFrame({"col1": S, "col2": S})

DF.resample("ME").interpolate(inplace=True)

Issue Description

Code raises an exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\envs\pandasstubs\lib\site-packages\pandas\core\resample.py", line 958, in interpolate
    result_interpolated = result_interpolated.loc[final_index]
AttributeError: 'NoneType' object has no attribute 'loc'

Expected Behavior

No exception. Works fine in pandas 2.2.2

Note: Seems to be introduced by https://github.com/pandas-dev/pandas/pull/56515 by @cbpygit , review by @MarcoGorelli and @jbrockmendel

Installed Versions

INSTALLED VERSIONS ------------------ commit : 34177d6b20896bf101590306e3f0e1e0a225c5fb python : 3.9.16.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 3.0.0.dev0+943.g34177d6b20 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 68.2.2 pip : 22.3.1 Cython : None pytest : 8.2.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.2.1 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.4 numba : None numexpr : 2.10.0 odfpy : None openpyxl : 3.1.2 pyarrow : 16.0.0 pyreadstat : 1.2.7 python-calamine : None pyxlsb : 1.0.10 s3fs : None scipy : 1.13.0 sqlalchemy : 2.0.30 tables : 3.9.2 tabulate : 0.9.0 xarray : 2024.3.0 xlrd : 2.0.1 zstandard : 0.22.0 tzdata : 2024.1 qtpy : None pyqt5 : None
cbpygit commented 6 months ago

@Dr-Irv Thank you for reporting this problem! I am able to reproduce this. The problem is purely related to the inplace=True option. The interpolation result here becomes None and internally the _update_inplace method call here does not yet possess the final result. There is still the step of index correction missing, which is only applied afterwards here.

The problem is that the context in pandas.core.generic.NDFrame.interpolate does not know about the correct final index. So far, I am unable to figure out how I can update the state of the parent data frame that led to the pandas.core.resample.Resampler.interpolate call. I tried to manipulate self._selected_obj but it does not propagate. If somebody can point me at the object that needs to be updated I can easily fix this.

cbpygit commented 5 months ago

I could use some input here @MarcoGorelli. As we are introducing a breaking change anyway, we could think about removing the inplace option. I think it is counter-intuitive that the second method in a chain of methods modifies the original data frame "in place". Just think about the completely valid option to split the calls like this:

df = ...
resampler = df.resample("ME")   # creates a resampler instance, does not modify df
resampler.interpolate(inplace=True)  # modifies df, returns None
MarcoGorelli commented 5 months ago

we could think about removing the inplace option.

yup - this looks like one of the places where inplace was a lie anyway? Agree, "in for a penny, in for a pound" - if we gonna break, let's break. It's for the better anyway

cbpygit commented 5 months ago

@MarcoGorelli šŸ™Œ I'll take care of it and ping you in the PR.

cbpygit commented 5 months ago

@MarcoGorelli I looked into this but it is extremely convoluted šŸ˜¢ The inplace option is passed on in multiple steps, partially to overloaded methods or abstract methods with many implementations. There are dozens of tests that use inplace=True or even test this in particular.

Before I spend a lot of time on this, could you indicate a safe way to approach it? Shall I really drop the inplace argument in all those methods, delete all the tests against the feature and switch to not-inplace in the tests that use it?

MarcoGorelli commented 5 months ago

yeah i'd be ok with that

MarcoGorelli commented 5 months ago

as in, remove it from interpolate. if any intermediate function (which is also used in other places) requires inplace, just pass inplace=False

spawn-guy commented 2 months ago

@cbpygit is this a good place to ask about dtype downcasting deprecation FutureWarning? i see you are responsible for fixing .resample()...

.resample().interpolate() now produces FutureWarning: DataFrame.interpolate with object dtype is deprecated and will raise in a future version. Call obj.infer_objects(copy=False) before interpolating instead. but .resample() doesn't have infer_objects()

there is also an interesting question outlined in here that is think is related: https://github.com/pandas-dev/pandas/issues/53631#issuecomment-1853635791

should we continue in this issue?