df.query() doesn't follow python truthy-ness?

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.62k stars 17.91k forks source link

df.query() doesn't follow python truthy-ness? #8560

Open kay1793 opened 10 years ago

kay1793 commented 10 years ago

In [44]: df=pd.DataFrame([[0,10],[1,20]],columns=['cat','count'])
    ...: df
Out[44]: 
   cat  count
0    0     10
1    1     20

In [45]: df.query('cat')
Out[45]: 
   cat  count
0    0     10
1    1     20

Expected the first row where cat==0 to be dropped since 0 is Falsey.

This unhelpful exception is how I stumbled over this

In [46]: df.query('cat & count > 10')
NotImplementedError: couldn't find matching opcode for 'and_blb'

TomAugspurger commented 10 years ago

Just to make sure you're aware:

In [3]: df.query('cat > 0 & count > 0')
Out[3]: 
   cat  count
1    1     20

And for comparison:

In [15]: df[df.cat]
Out[15]: 
   cat  count
0    0     10
1    1     20

which gives the same as df.query('cat')

I think there have been previous issues about how to handle these cases (though not with respect to query specially)

kay1793 commented 10 years ago

Thanks Tom, yeh I got the "fix".

The truthiness might be tricky like you said too, but you got lucky I think it's not doing what you think:

In [8]: df=pd.DataFrame([[1,10],[0,20]],columns=['cat','count'])
   ...: df[df.cat]
Out[8]: 
   count  cat
0     10    1
1     20    0

In [9]: df.query('cat')
Out[9]: 
   cat  count
1    0     20
0    1     10

or similar

In [12]: df=pd.DataFrame([[0,10],[1,20],[2,np.nan]],columns=['cat','count'])
    ...: df[df.cat]
IndexError: indices are out-of-bounds

In [13]: df.query('cat')
Out[13]: 
   cat  count
0    0     10
1    1     20
2    2    NaN

Anyway, was matching df[expr] and df.query(expr) an explicit goal or promise? not sure. doesn't look like it.

The AST exception should not happen I'm more convinced.

jreback commented 10 years ago

@kay1793 the error handling could be improved here. pull-request?

kay1793 commented 10 years ago

The docstring says query expects a boolean expression and It's doesn't complain or document what happens when the result of the expression is not a boolean and it doesn't coerce the result into a boolean array.

Sorry jreback, I have too much on my hands currently to dive in the query code.

jreback commented 10 years ago

@kay1793

@cpcloud and I discussed this yesterday, your exprssion should raise as it doesn't result in a boolean. So we'll fix this.

kay1793 commented 10 years ago

tx @jreback ! Also found #8568, sorry I can't help with the fixes now.

JonHannah commented 7 years ago

I just tripped over this. Glad this thread came up in Google :)

@jreback are you still looking for someone to help with this?

jreback commented 7 years ago

@JonHannah sure! any open issues that are interesting need help :>

JonHannah commented 7 years ago

Great - I'll try and take a look sometime in the next week

wesm commented 6 years ago

@JonHannah do you want to have a look at this?

JonHannah commented 6 years ago

Sorry - I don't have time at the moment 😞

a-y-khan commented 4 years ago

Started looking at this issue, and it looks like df.query()'s behavior has changed since this issue was created. Is this the intended behavior now?


In [1]: import pandas as pd
   ...: print(pd.__version__)
   ...: df=pd.DataFrame([[0,10],[1,20]],columns=['cat','count'])
   ...: display(df)
   ...: 
   ...: 
1.1.0.dev0+765.g7fa8ee728
   cat  count
0    0     10
1    1     20

In [2]: display(df.query('cat & count > 10'))
   cat  count
1    1     20

In [3]: display(df.query('cat > 0 & count > 10'))
   cat  count
1    1     20```

rhshadrach commented 4 years ago

@a-y-khan I don't believe the behavior is currently correct. Your example does not touch upon the issue here; if you change the query to count >= 10, I think you would see two rows whereas the correct behavior is to only include one.

Update: Even changing the query to count >= 10 surprisingly only returns a single row. In any case, I misread the issue; the agreed upon correct behavior here is to raise since cat is not boolean.

NumberPiOso commented 2 years ago

take