Open edwardluohao opened 3 months ago
The problem appears already after the replace
call:
import cudf
import numpy as np
s = cudf.Series([1, -np.inf, np.inf])
print(s.replace([-np.inf, np.inf], np.nan))
print(s.replace(-np.inf, np.nan).replace(np.inf, np.nan))
The former produces:
0 1.0
1 NaN
2 NaN
dtype: float64
The latter:
0 1.0
1 <NA>
2 <NA>
dtype: float64
groupby.ffill
handles the latter case, but not the former, in the way you might expect from pandas (where NaN is consider a missing value).
I agree that replace
should produce the same output for the two examples in this comment (I think the latter is "more correct").
To work around this, if you replace your usage of np.nan
in your replace call with None
, then everything works as anticipated.
Note that this is a consequence of cudf being slightly stricter than pandas in a number of places when it comes to differences between nan
and NA
, the latter indicates and actually missing value, the former (in cudf) does not.
Describe the bug There is an inconsistency in the forward fill behavior of cudf when replacing np.inf and -np.inf values using a list. The same operation works correctly with pandas or replace np.inf and -np.inf seperately.
Steps/Code to reproduce bug
Output
Expected behavior DataFrame after forward fill: group value 0 A 1.0 1 A 1.0 2 A 3.0 3 B
4 B 5.0
5 B 5.0
Environment overview (please complete the following information)
it works fine if seperate the replace by:
or use pandas instead