rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.31k stars 886 forks source link

[BUG] After replace [-np.inf, np.inf] with np.nan, group forward fill not working. #16136

Open edwardluohao opened 3 months ago

edwardluohao commented 3 months ago

Describe the bug There is an inconsistency in the forward fill behavior of cudf when replacing np.inf and -np.inf values using a list. The same operation works correctly with pandas or replace np.inf and -np.inf seperately.

Steps/Code to reproduce bug

import cudf
import numpy as np

data = {
    'group': ['A', 'A', 'A', 'B', 'B', 'B'],
    'value': [1, -np.inf, 3, np.inf, 5, np.inf]
}

df = cudf.DataFrame(data)

print("Original DataFrame:")
print(df)

df['value'] = df['value'].replace([-np.inf, np.inf], np.nan)
df['value'] = df.groupby('group')['value'].ffill()

print("\nDataFrame after forward fill:")
print(df)

Output

DataFrame after forward fill: group value 0 A 1.0 1 A NaN 2 A 3.0 3 B NaN 4 B 5.0 5 B NaN

Expected behavior DataFrame after forward fill: group value 0 A 1.0 1 A 1.0 2 A 3.0 3 B 4 B 5.0 5 B 5.0

Environment overview (please complete the following information)

it works fine if seperate the replace by:

df['value'] = df['value'].replace(-np.inf, np.nan)
df['value'] = df['value'].replace(np.inf, np.nan)

or use pandas instead

wence- commented 3 months ago

The problem appears already after the replace call:

import cudf
import numpy as np

s = cudf.Series([1, -np.inf, np.inf])

print(s.replace([-np.inf, np.inf], np.nan))

print(s.replace(-np.inf, np.nan).replace(np.inf, np.nan))

The former produces:

0    1.0
1    NaN
2    NaN
dtype: float64

The latter:

0     1.0
1    <NA>
2    <NA>
dtype: float64

groupby.ffill handles the latter case, but not the former, in the way you might expect from pandas (where NaN is consider a missing value).

I agree that replace should produce the same output for the two examples in this comment (I think the latter is "more correct").

To work around this, if you replace your usage of np.nan in your replace call with None, then everything works as anticipated.

Note that this is a consequence of cudf being slightly stricter than pandas in a number of places when it comes to differences between nan and NA, the latter indicates and actually missing value, the former (in cudf) does not.