rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.27k stars 884 forks source link

[BUG] fillna with dataframe containing list dtypes fails #12274

Open VibhuJawa opened 1 year ago

VibhuJawa commented 1 year ago

Describe the bug

fillna with dataframe containing list dtypes fails .

Steps/Code to reproduce bug

import cudf
s1 = cudf.Series(['game', 'game', None,None,None])
s2 = cudf.Series([[1], [1], None,None,None])
df = cudf.DataFrame({'_TYPE_':s1, 'h':s2})

c1 = cudf.Series([None,None, 'user', 'user', 'user'])
c2 = cudf.Series([None, None, [2], [2], [2]])
subdf = cudf.DataFrame({'_TYPE_':c1, 'h':c2})
df.fillna(subdf)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [23], line 9
      7 c2 = cudf.Series([None, None, [2], [2], [2]])
      8 subdf = cudf.DataFrame({'_TYPE_':c1, 'h':c2})
----> 9 df.fillna(subdf)

File /datasets/vjawa/miniconda3/envs/cugraph_dgl_dev/lib/python3.9/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File /datasets/vjawa/miniconda3/envs/cugraph_dgl_dev/lib/python3.9/site-packages/cudf/core/indexed_frame.py:1842, in IndexedFrame.fillna(self, value, method, axis, inplace, limit)
   1837 @_cudf_nvtx_annotate
   1838 def fillna(
   1839     self, value=None, method=None, axis=None, inplace=False, limit=None
   1840 ):  # noqa: D102
   1841     old_index = self._index
-> 1842     ret = super().fillna(value, method, axis, inplace, limit)
   1843     if inplace:
   1844         self._index = old_index

File /datasets/vjawa/miniconda3/envs/cugraph_dgl_dev/lib/python3.9/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File /datasets/vjawa/miniconda3/envs/cugraph_dgl_dev/lib/python3.9/site-packages/cudf/core/frame.py:905, in Frame.fillna(self, value, method, axis, inplace, limit)
    899 should_fill = (
    900     col_name in value
    901     and col.contains_na_entries
    902     and not libcudf.scalar._is_null_host_scalar(replace_val)
    903 ) or method is not None
    904 if should_fill:
--> 905     filled_data[col_name] = col.fillna(replace_val, method)
    906 else:
    907     filled_data[col_name] = col.copy(deep=True)

File /datasets/vjawa/miniconda3/envs/cugraph_dgl_dev/lib/python3.9/site-packages/cudf/core/column/column.py:609, in ColumnBase.fillna(self, value, method, dtype)
    599 def fillna(
    600     self: T,
    601     value: Any = None,
    602     method: str = None,
    603     dtype: Dtype = None,
    604 ) -> T:
    605     """Fill null values with ``value``.
    606 
    607     Returns a copy with null filled.
    608     """
--> 609     return libcudf.replace.replace_nulls(
    610         input_col=self, replacement=value, method=method, dtype=dtype
    611     )

File replace.pyx:139, in cudf._lib.replace.replace_nulls()

File /datasets/vjawa/miniconda3/envs/cugraph_dgl_dev/lib/python3.9/site-packages/cudf/core/single_column_frame.py:95, in SingleColumnFrame.__bool__(self)
     94 def __bool__(self):
---> 95     raise TypeError(
     96         f"The truth value of a {type(self)} is ambiguous. Use "
     97         "a.empty, a.bool(), a.item(), a.any() or a.all()."
     98     )

TypeError: The truth value of a <class 'cudf.core.series.Series'> is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Expected behavior

I expect this to work like it does for non list columns.

import cudf
s1 = cudf.Series(['game', 'game', None,None,None])
s2 = cudf.Series([1, 1, None,None,None])
df = cudf.DataFrame({'_TYPE_':s1, 'h':s2})

c1 = cudf.Series([None,None, 'user', 'user', 'user'])
c2 = cudf.Series([None, None, 2, 2, 2])
subdf = cudf.DataFrame({'_TYPE_':c1, 'h':c2})
df.fillna(subdf)
    _TYPE_  h
0   game    1
1   game    1
2   user    2
3   user    2
4   user    2

Environment overview (please complete the following information)

Additional context

Impacts property graph in cugraph where we use it for replacing fillna

https://github.com/rapidsai/cugraph/blob/cb0d0923616f656ec816f999aa633ecbf3c57267/python/cugraph/cugraph/structure/property_graph.py#L758

https://github.com/rapidsai/cugraph/blob/cb0d0923616f656ec816f999aa633ecbf3c57267/python/cugraph/cugraph/structure/property_graph.py#L1155

GregoryKimball commented 1 year ago

Thank you for posting this issue. fillna seems to be a function that has issues with list type support.

FWIW here is a shorter repro:

import cudf
df = cudf.DataFrame({'a':[[1], [1], None,None,None]})
subdf = cudf.DataFrame({'a':[None, None, [2], [2], [2]]})
df.fillna(subdf)
eriknw commented 1 year ago

Bump.

xref: https://github.com/rapidsai/cugraph/issues/3010

Other examples using cudf.Series:

import cudf

s = cudf.Series([[1], None, [3], None])
t = cudf.Series([None, [2], None, [4]])

# These all fail
s.fillna(t)
s.fillna(cudf.Scalar([0]))
s.fillna([0])
s[s.isnull()] = t
s[s.isnull()] = t[s.isnull()]

## These work:
# s[s.isnull()] = [0]
# t[t.isnull()] = cudf.Scalar([0])

It would be straightforward to add the Python layer for this (probably add ListColumn.fillna), but it would still need a low-level implementation (perhaps in CUDA called from cpp/src/replace/nulls.cu:replace_nulls_column_kernel_forwarder).