rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.45k stars 903 forks source link

[FEA] Propagate nulls through `isin` #7556

Open brandon-b-miller opened 3 years ago

brandon-b-miller commented 3 years ago

Is your feature request related to a problem? Please describe. In pandas, we can check if the values of a series or dataframe are contained within some other container, like a list or dataframe, by using isin. Currently, this doesn't work correctly for nulls. On branch-0.19, if the dataframe or series we're checking contains an <NA>, we get a False:

>>> values = cudf.Series([1,2,3])
>>> df = cudf.DataFrame({'a':[1,2,None]})
>>> df
      a
0     1
1     2
2  <NA>
>>> df.isin(values)
       a
0   True
1   True
2  False

Where we should get just another <NA> there, like in pandas, using nullable dtypes:

>>> values = pd.Series([1,2,3], dtype='Int64')
>>> df = pd.DataFrame({'a':pd.Series([1,2,None], dtype='Int64')})
>>> df
      a
0     1
1     2
2  <NA>
>>> df.isin(values)
      a
0  True
1  True
2  <NA>

While the fillna that causes us to get False is being removed in PR https://github.com/rapidsai/cudf/pull/7490, we'll need to rework how we're testing this functionality and change it to test against nullable types. It just so happens that when using non nullable pandas types, we get False as well - hence our results lining up so far.

Describe the solution you'd like We should get an <NA> everywhere the series or dataframe in question already has an <NA> and our tests should be updated to reflect that.

Describe alternatives you've considered We could change it as part of PR https://github.com/rapidsai/cudf/pull/7490 but it would be somewhat tangential to the point.

Additional context Add any other context, code examples, or references to existing implementations about the feature request here.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

wence- commented 12 months ago

Amusingly this only happens with DataFrame.isin, if asking isin of Series objects, pandas does like cudf.