rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.09k stars 874 forks source link

[BUG] DataFrame.query binary-ops do not match pandas #12568

Open mattf opened 1 year ago

mattf commented 1 year ago

Describe the bug using import cudf as pd

Steps/Code to reproduce bug

In [1]: import cudf as pd

In [2]: pd.__version__
Out[2]: '22.12.00'

In [3]: df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
   ...:                          "bar", "bar", "bar", "bar"],
   ...:                    "B": ["one", "one", "one", "two", "two",
   ...:                          "one", "one", "two", "two"],
   ...:                    "C": ["small", "large", "large", "small",
   ...:                          "small", "large", "small", "small",
   ...:                          "large"],
   ...:                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
   ...:                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})

In [4]: df.query("D > 5 & E > 5")
Out[4]: 
Empty DataFrame
Columns: [A, B, C, D, E]
Index: []

In [5]: df.to_pandas().query("D > 5 & E > 5")
Out[5]: 
     A    B      C  D  E
7  bar  two  small  6  9
8  bar  two  large  7  9

In [6]: df.query("D > 5 | E > 5")
Out[6]: 
Empty DataFrame
Columns: [A, B, C, D, E]
Index: []

In [7]: df.to_pandas().query("D > 5 | E > 5")
Out[7]: 
     A    B      C  D  E
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9
GregoryKimball commented 1 year ago

I'm considering closing this as an issue where we do not want to match pandas.

We support these correctly: df.query("(D > 5) & (E > 5)"), with parentheses df.query("D > 5 and E > 5"), with "and" df.query("D > 5 & E > 5"), IMO should not be considered as valid...

vyasr commented 2 months ago

This issue here is, I think, similar to what we document for eval where the meaning of the operators is type dependent. In this case, considering the following:

In [114]: np.bitwise_and(df["D"], df["E"])
Out[114]:
0    0
1    0
2    0
3    1
4    2
5    4
6    0
7    0
8    1
dtype: int64

In [115]: df.query("(D & E) > 1")
Out[115]:
     A    B      C  D  E
4  foo  two  small  3  6
5  bar  one  large  4  6

Here & is clearly being treated as a bitwise operator because the logical and would be empty (it can only return 0 or 1, so no rows would be >1). However, Python operator precedence rules clearly state that & is higher precedence than >, which in turn is higher precedence than and. By that logic,

df.query("D > 5 & E > 5")

Should be interpreted as

df.query("D > (5 & E) > 5")

which is obviously nonsensical. The only way to produce the observed pandas result is if this is being interpreted as

df.query("(D > 5) and (E > 5)")

I would consider both this and the behavior of eval in this kind of situation to be bugs in pandas, which should treat these Python operators as they would be treated in pure Python code and follow the expected conventions.

@mroeschke WDYT?

mroeschke commented 2 months ago

In the pandas docs, it does state that eval (which executes query) makes bitwise and boolean operators have the same precedence with it's default "pandas" engine. There does exist a "python" engine that should make these operators have the expected language precedence rules

https://pandas.pydata.org/docs/user_guide/enhancingperf.html#pandas-eval-parsers https://pandas.pydata.org/docs/reference/api/pandas.eval.html#pandas.eval

vyasr commented 2 months ago

Oh good call, I didn't consider the behavior of the different engines. This distinction is potentially problematic for cudf.pandas users. What we should consider doing is adding support for the engine keyword in cudf, but have the default value be "python" when using cudf directly and have it be "pandas" when cudf.pandas is active. We should raise an exception if the mode is anything other than "python". That way, cudf's default behavior is to match the python engine and it works out of the box, whereas with cudf.pandas (perhaps whenever the pandas-compatible option is set, depending on how we configure this; we could also monkey-patch) the default behavior is to use the nonexistent "pandas" engine, which triggers an exception and then falls back to pandas. That way users only get the speedup if they specify a python engine. It's a bit unfortunate since I imagine that most users don't specify an engine, but the only immediate alternative is giving users unexpected results. The other option is to actually implement the other semantics, of course, but I don't know if we can prioritize that anytime soon.