Open mattf opened 1 year ago
I'm considering closing this as an issue where we do not want to match pandas.
We support these correctly:
df.query("(D > 5) & (E > 5)")
, with parentheses
df.query("D > 5 and E > 5")
, with "and"
df.query("D > 5 & E > 5")
, IMO should not be considered as valid...
This issue here is, I think, similar to what we document for eval
where the meaning of the operators is type dependent. In this case, considering the following:
In [114]: np.bitwise_and(df["D"], df["E"])
Out[114]:
0 0
1 0
2 0
3 1
4 2
5 4
6 0
7 0
8 1
dtype: int64
In [115]: df.query("(D & E) > 1")
Out[115]:
A B C D E
4 foo two small 3 6
5 bar one large 4 6
Here &
is clearly being treated as a bitwise operator because the logical and would be empty (it can only return 0 or 1, so no rows would be >1). However, Python operator precedence rules clearly state that &
is higher precedence than >
, which in turn is higher precedence than and
. By that logic,
df.query("D > 5 & E > 5")
Should be interpreted as
df.query("D > (5 & E) > 5")
which is obviously nonsensical. The only way to produce the observed pandas result is if this is being interpreted as
df.query("(D > 5) and (E > 5)")
I would consider both this and the behavior of eval in this kind of situation to be bugs in pandas, which should treat these Python operators as they would be treated in pure Python code and follow the expected conventions.
@mroeschke WDYT?
In the pandas docs, it does state that eval
(which executes query
) makes bitwise and boolean operators have the same precedence with it's default "pandas"
engine. There does exist a "python"
engine that should make these operators have the expected language precedence rules
https://pandas.pydata.org/docs/user_guide/enhancingperf.html#pandas-eval-parsers https://pandas.pydata.org/docs/reference/api/pandas.eval.html#pandas.eval
Oh good call, I didn't consider the behavior of the different engines. This distinction is potentially problematic for cudf.pandas users. What we should consider doing is adding support for the engine keyword in cudf, but have the default value be "python" when using cudf directly and have it be "pandas" when cudf.pandas is active. We should raise an exception if the mode is anything other than "python". That way, cudf's default behavior is to match the python engine and it works out of the box, whereas with cudf.pandas (perhaps whenever the pandas-compatible option is set, depending on how we configure this; we could also monkey-patch) the default behavior is to use the nonexistent "pandas" engine, which triggers an exception and then falls back to pandas. That way users only get the speedup if they specify a python engine. It's a bit unfortunate since I imagine that most users don't specify an engine, but the only immediate alternative is giving users unexpected results. The other option is to actually implement the other semantics, of course, but I don't know if we can prioritize that anytime soon.
Describe the bug using
import cudf as pd
Steps/Code to reproduce bug