rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
7.99k stars 866 forks source link

[BUG] Treatment of logical and bitwise binops in `DataFrame.eval` does not match pandas #14517

Open wence- opened 7 months ago

wence- commented 7 months ago

Describe the bug

In pandas, eval treats a op b as always meaning the bitwise version op in {and, or, &, |, ^}. In cudf, due to the way we parse the expression without type information (and the dispatching scheme to the AST interpreter in libcudf), and and or mean "logical" and &, |, and ^ mean "bitwise". There's a final wrinkle that (like spark) for bools only, masked values are treated as False.

This can cause differences in the result between calling eval with pandas and with cudf. Although the docstring mentions these differences, when using cudf.pandas, we don't see the cudf docstring (only the pandas one).

Expected behavior

Eventually, we should match pandas. I think this should be done by running a type inference pass on the user-provided expression and rewriting to an appropriate combination of bitwise and logical operations. This would have the nice side-effect of also allowing mixed-type operands in eval expressions by cudf upcasting before passing off to libcudf.

In the short term, we should probably raise a NotImplementedError when running in pandas-compat mode if the expression contains logical/bitwise binops.

vyasr commented 7 months ago

Clarifying this statement:

There's a final wrinkle that (like spark) for bools only, masked values are treated as False.

this refers to the output of boolean column operations:

>>> pd.DataFrame({'a': [10], 'b': [True]}).eval('a and a')
0    10
Name: a, dtype: int64
>>> pd.DataFrame({'a': [10], 'b': [True]}).eval('a & a')
0    10
Name: a, dtype: int64
>>> pd.DataFrame({'a': [10], 'b': [True]}).eval('b and b')
0    True
Name: b, dtype: bool
>>> pd.DataFrame({'a': [10], 'b': [True]}).eval('b & b')
0    True
Name: b, dtype: bool

IOW the statement being true relies on bitwise operators between bools being defined as returning bools.

wence- commented 7 months ago

No, replace your a column with a null value. the behaviour is different if it is a masked bool column to when it is a masked any other kind of column

vyasr commented 7 months ago

Ah OK. In that case my comment is more relevant as a separate statement about what behavior we need to see.