Open wence- opened 7 months ago
Clarifying this statement:
There's a final wrinkle that (like spark) for bools only, masked values are treated as False.
this refers to the output of boolean column operations:
>>> pd.DataFrame({'a': [10], 'b': [True]}).eval('a and a')
0 10
Name: a, dtype: int64
>>> pd.DataFrame({'a': [10], 'b': [True]}).eval('a & a')
0 10
Name: a, dtype: int64
>>> pd.DataFrame({'a': [10], 'b': [True]}).eval('b and b')
0 True
Name: b, dtype: bool
>>> pd.DataFrame({'a': [10], 'b': [True]}).eval('b & b')
0 True
Name: b, dtype: bool
IOW the statement being true relies on bitwise operators between bools being defined as returning bools.
No, replace your a column with a null value. the behaviour is different if it is a masked bool column to when it is a masked any other kind of column
Ah OK. In that case my comment is more relevant as a separate statement about what behavior we need to see.
Describe the bug
In pandas,
eval
treatsa op b
as always meaning the bitwise versionop in {and, or, &, |, ^}
. In cudf, due to the way we parse the expression without type information (and the dispatching scheme to the AST interpreter in libcudf),and
andor
mean "logical" and&
,|
, and^
mean "bitwise". There's a final wrinkle that (like spark) for bools only, masked values are treated asFalse
.This can cause differences in the result between calling
eval
with pandas and with cudf. Although the docstring mentions these differences, when usingcudf.pandas
, we don't see the cudf docstring (only the pandas one).Expected behavior
Eventually, we should match pandas. I think this should be done by running a type inference pass on the user-provided expression and rewriting to an appropriate combination of bitwise and logical operations. This would have the nice side-effect of also allowing mixed-type operands in
eval
expressions by cudf upcasting before passing off to libcudf.In the short term, we should probably raise a
NotImplementedError
when running in pandas-compat mode if the expression contains logical/bitwise binops.