pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.75k stars 17.96k forks source link

ENH: Add `area_limit` to `fillna #60161

Open joshdunnlime opened 1 week ago

joshdunnlime commented 1 week ago

Feature Type

Problem Description

The pandas methods interpolate, ffill and bfill all have the area_limit options, however, fillna does not. It would be nice to add this.

Feature Description

DataFrame.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, area_limit=None, downcast=<no_default>)

Alternative Solutions

Interpolate with method='constant'. The somewhat obvious downsides to this are that constant isn't included in the scipy interpolation API.

Additional Context

See https://github.com/pandas-dev/pandas/issues/56492 for this functionality added to ffill and bfill. It would be nice to have better API consistency between thee methods and also interpolate.

rhshadrach commented 1 week ago

interpolate, ffill, and bfill all fill values using values near the given location. With the method argument being deprecated, fillna does not. It doesn't seem appropriate to have limit_area because fillna does not work with nearby values.

joshdunnlime commented 1 week ago

I get that it doesn't fill with nearby values but I don't see why it couldn't fill missing values with some knowledge of the rows around it? After all it has the limit kwarg.

What about my other suggestion of having a constant term on interpolate? (I still think it makes more sense to have limit_area on fillna as its a very valid use case as shown by ffill etc)

rhshadrach commented 1 week ago

I get that it doesn't fill with nearby values but I don't see why it couldn't fill missing values with some knowledge of the rows around it?

I'm not saying it couldn't. But this increases the scope of the function which I think is undesirable.

What about my other suggestion of having a constant term on interpolate?

I don't think that fits the definition of "interpolate", so I find this undesirable from an API design perspective.

One can implement this behavior as follows:

df = pd.DataFrame({"a": [np.nan, 1, np.nan, 2], "b": [1, np.nan, 2, np.nan]})
isna = df.isna()

# inside
mask = isna & (~isna).cummax() & (~isna).loc[::-1].cummax()
print(df.mask(mask, 5.0))
#      a    b
# 0  NaN  1.0
# 1  1.0  5.0
# 2  5.0  2.0
# 3  2.0  NaN

# outside
mask = isna & (isna.cummin() | isna.loc[::-1].cummin())
print(df.mask(mask, 5.0))
#      a    b
# 0  5.0  1.0
# 1  1.0  NaN
# 2  NaN  2.0
# 3  2.0  5.0

Perhaps this is too technical to be expected from users though. cc @pandas-dev/pandas-core for any thoughts.