Closed jreback closed 11 years ago
To follow on from your comment, we shouldn't we be using &
and |
? I think this may also have the benefit of all
and any
just working.
Also ~
for not/invert (since would make it the same as numpy).
I haven't got my head around numexpr yet, so I may be talking complete nonsense. (I've moved Term to expressions without breaking things, and changed the repr to eval back to itself (was there a reason for it not?).
I agree about the operators (though I think you actually need to accept both), these are always going to be in a string expression in any event....because you need delayed evaluation
but since we actually DO want the &
etc...you can just replace them (e.g. this is really a user interface issue), we are not actually going to evaluate them
e.g.
df[(df > 0) and (df < 10)]
vs
df['(df > 0) and (df < 10)]']
i'm sure everyone involved in this thread knows this but just wanted to point out that the precedence of and
and &
is different. if i was a first time user i would think that df > 0 and df < 10
and df > 0 & df < 10
do the same thing, so if both are going to be supported i think precedence rules should be kept as close to python as possible meaning parens are required for &
but not for and
.
@cpcloud this is in a string context, so in theory you can make them the same (as this is a big confusion I think in operating in pandas, I think people expect them to be the same (even though they are wrong)
@jreback sure. i was just semi-thinking-out-loud here, thought that it might warrant a discussion. this goes back to python core devs not wanting to provide the ability to overload not
and
and or
so numpy was forced to overload bitwise operators for boolean operations (there's a youtube video of a discussion about this with GVR, there's even a patch to core python that allows you to do this). i really wish that pep went through sigh. i didn't realize there was a big confusion here, since this really has nothing to do with pandas, it's a language feature/bug. i was just thinking that adding more parsing rules to remember is annoying to users.
it's a valid point
the purpose of eval is to facilitate multi expression parsing that we will evaluate in Numexpr so we have to have a string input (to avoid python evaluation of the sub expressions) or maybe there is a way to disable this (like how %timeit works in ipython) but i think they r using an input hook and hence everything really is a string
@jreback u can do it with the cmd
module too. i think ipython used to use that or maybe they still do. i think only macros would allow you do this without string literals. btw there is now a Python macros library. i haven't tried it out but it looks like fun. another possibility is to support numba as a method although first things first (numexpr). do u already have something going for this?
@hayd said he was giving a stab Andy can u post a link to your branch?
@cpcloud numba
is interesting, but the infrastructure requirement is high, and in any event, its basically using numexpr under the hood :) (as well as ctable
for storage)
@cpcloud I reread your question
the issue is this: df[(df>0) & (df<10)]
is evaluated as 3 separate sub-expressions, plus a boolean mask selection
while
df.eval('(df>0) & (df<10)')
can be evaluated (after alignment) in a single numexpr pass (and then a boolean mask) to return the dataframe, so can be a massive speedup
that's the main reason for this function
@jreback that is pretty cool. i haven't done much with numexpr, i assumed that pandas uses it when it can...is that a fallacious assumption? should i be explicitly using numexpr?
it's used in pytables for query processing and in most evaluations now as of 0.11 (you need a fairly big frame for it to matter)
see the core/expressions module
I haven't done much so far, I've moved Term to expressions and added some helper functions for that class, not have I really looked in to numexp yet.
I kind of lost my way on the road map... and may be totally confused atm.
Am I way off here?
1. move term to expression
so there are 3 goals here:
1) parser to turn:
'df[(df>0)&(df<10)]'
into this (call this the parsed_expr)
df[_And(Term('df','>',0),Term('df','<','10'))]
2) take a parsed_expr, align the variables (e.g. perform stuff like what combine_frame
, combine_series
, combine_scalar
does (e.g. the alignment/reindexing steps), call this the aligned_expr
3) take aligned_expr and turn this into a numexpr expression (like what Term
does and the expressions
module does (though its very simple), this would be an exapansion of expresssions
to take in an aligned Term
s with their boolean operators (e.g. _And/_Or/_Not
and parens)
1) involves tokenizing/ast manip (kind of like numexpr.evaluate
does) to form the Terms
; I am not sure how tricky this is, so we were going to skip for now
2) this is straightforward: take the parsed_expr and substitue variables that are aligned (keep frames as frames), don't need to exapand scalars at allow, mainly just reindex things that need, create the aligned_expr
3) this is straightforward too, just take the term expressions and generate the numexpr itself
so I think termset
is really Term
, plus the boolean operators, and a grouping operator (the parens)
these just allow easy expression manip (your 2)
your 3 (skip for now, that's my 1)
your 4 is my 3
I don't think you need 5
@jreback i know u said skip 1 but i can do that if u want (lots of nice python builtins for dealing with python source) while @hayd does 2 and 3. what would be allowed to be parsed? expr
s in the python grammar? or just boolean exprs? could start with booleans fornow and extend after that is working...
the more there merrier!
let's start with the example
df.eval('(df>0)&(df<10)')
This is really about the masks as that's where all the work is done
but I think it would be nice evenutally to do something on the rhs as well:
pd.eval('df[(df>0)&(df<10)] = df * 2 + 1', engine='numexpr')
so we can support getitem
and setitem
and pass both the lhs and rhs to the evaluator
(imagine engine = 'multi-process' or 'out-of-core')......
to the heck with blaze! (well maybe engine=blaze
is ok too)
I think I was worried that nested Terms wouldn't come for free with _And and _Or, but I'll put something together imminently and we can see whether it does. :)
We can just tell everyone it's blaze...
i've got it parsing nested and terms already :)
albeit they are strings right now and only &
(parsing and
is different), i haven't written the _And
class yet
@cpcloud I would just use the &
, |
, and ~
for now (to keep consistent), can always add later
@cpcloud
the end goal is to create a numexpr
expression (the functionaility is in the Selection
class in io/pytables.py
); so the class that holds the parses expression (the nested _And/_Or) should parse to this (and has to do type translation and such), also this class could do the alignment I think (which is the reason for having the parsed expression, so you can basically just iterate thru all of the terms and see what needs to be aligned)
e.g.
for t in term_expression:
t.align()
Term align (pseudo codish)
def align(self):
self.lhs
self.op
self.rhs
if self.lhs ia DataFrame:
if self.rhs is a Series....
is a Frame
is a Scalar
maybe return a new expression that is aligned
ah i see. so an Expr
class should hold the ands and ors which consist of terms (or nested expressions). Expr
could have an align method which aligns and then passes to numexpr
. is that correct?
I think you actually need 3 classes here:
1) Term
which holds lhs operator rhs
(and prob a reference back to the top-level Expr
for variable inference 2)
Termset, although maybe
Expr, or maybe
Terms? is better here (I mean a nested list of
_and,_or,_notoperators on the
Terms)
3) Top-level, maybe Expression
, which holds 2) the termset, and the engine and such
e.g.
pd.eval('df[(df>0)&(df<10)'])
yields
Expression():
original string
df[mask] (you need to keep this where)
termset of the boolean expression
engine
maybe an environment points (this is like a closure) but we are not fancy here :)
methods:
parse (create the termset)
align (have the termset align)
convert_to_engine_format (return the converted termset)
Termset():
_and(Term('df','>',0),Term('df','<',10))
methods:
align (maybe return a new termset that is aligned)
convert_to_engine_format (return the converted to engine format,
this would be a string)
lol gh doesn't like ur rst flavored monospace
This was where I was up to: https://github.com/hayd/pandas/tree/term-refactor
possible engines right now are 'numexpr'
and 'pytables'
?
well....pytables target is the same, numexpr, only difference is that the Terms need to do different alignment (as they are scalar type conditions, e.g. index>20130523
, where index is a field
in the table, and the date gets translated to i8
; so do need support for that (so yes you could use engine=pytables
) to handle that, but in pytables need to have what I call the queryables
dict passed in anyhow for validation (whereas in the case of a boolean expression you have the df
passed in) (or taken from the locals()
)
@jreback @hayd fyi for some reason expressions.py
has dos line endings while, for example, frame.py does not. isn't git supposed to take of this? it's pretty annoying and will cause a billion and one merge conflicts...it's just that file: i just ran dos2unix
on all of pandas and that's the only thing changed. i did this after a fresh clone
that was might fault I had the wrong setting on git which used the widows line endings (I change my git so now it won't change them) I actually edit using xemacs on a pc even though I do everything on Linux so u could change them if u want no big deal
oh ok cool thanks
@jreback by align here you're referring to something slightly different from the align methods on frames and series right? for example, the expression df > 0
will result in a Term('df', '>', [0])
object. aligning a list to a df
could be done by conversion to Series
but then alignment will result in the original df and a Series
with a bunch of NaN
s except where the non-nan values were in the original list
the entire expression will be passed as numpy values when I say alignment I mean anything that needs reindexing or some sort of conversions needs to be done
eg see methods combineframe combine series combine_scalar
so in your example nothing needs to be aligned (as a scalar can be passed directly)
step thru what happens when you do
df1 + df2 on frames that overlap (but are not identical)
see what is passed to expressions.evaluate
that's what the end result, so alignment needs to make the shapes match up (but for example scalars don't need any treatment)
ah i c. totally clear now thanks
i don't know why i didn't think of this before but align
will work here.
for sure
BUT you will need to apply it to the entire expression (it's meant for a binary argument now. self and other)
but I don't think h need it directly really just should use the combine_* functions as a guide
Right now align works recursively over binary operators and their operands until a Term is reached. I'm guessing that align might be inefficient in space for large frames there. Need to check. On May 28, 2013 7:15 PM, "jreback" notifications@github.com wrote:
for sure
BUT you will need to apply it to the entire expression (it's meant for a binary argument now. self and other)
but I don't think h need it directly really just should use the combine_* functions as a guide
— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/3393#issuecomment-18586504 .
ok remember there is only going to be a small number of ops/ terms (compared to the size of frames, in general)
Ah true. Very long expressions will be rare. On May 28, 2013 7:23 PM, "jreback" notifications@github.com wrote:
ok remember there is only going to be a small number of ops/ terms (compared to the size of frames, in general)
— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/3393#issuecomment-18586797 .
hm how to deal with calls to __getitem__
here numexpr
doesn't like these. i've got it evaling expressions without getitem calls, sadly i had to pretty much break all of the existing pytables functionality (current Term
class is not general enough)
you have to translate everything to numpy values Numexpr shouldn't see anything except numpy, operators, and scalars
numexpr doesn't support indexing and it won't ever. i suppose i could do the slicing at the last minute and special case the parser for that
that's the point of tr alignment step you need to create a string expression with only numpy arrays, operators, and strings broadcasting is ok (eg let numpy do it), but nothing else you may need to do dtype conversions (maybe)
ok. then for now while i'm working on this i'll ignore things like df[index1][index2]
or attribute indexers and such.
well as long as df -> df.values and index1 and index2 were converted it could be ok in any event we just want to start with a subset of operations u can also bail out and just evaluate like normal
@jreback a few thoughts...
u can also bail out and just evaluate like normal
probably not a good idea to evaluate like normal since that would require the use of eval
.
well as long as df -> df.values and index1 and index2 were converted it could be ok
numexpr
does use the array interface so df -> df.values is ok, but you cannot pass an expression with get/setitem calls. by converted i'm guessing you mean the indices are converted to numpy arrays which they will be as long as they implement the array interface.
the issue of get/setitem calls remains and i will have to address for this to be at all useful. to handle those calls the parser would need to special case it, but only in the case of the numexpr engine, e.g., a future engine might be able to handle those calls without any need to special case the parsing e.g., blaze, numba, picloud, etc. should i do that? if so then a set of Parser
classes is in order i think probably one for each engine where there is a default implementation of get/setitem expression parsing...
actually, i think it might be simpler to start with df.eval('df > 0', engine='numexpr')
then to start with messing with Term
and friends, since i could just eval the expression with whatever engine and then pass to df[the_engine_evald_expr]
You can't use/rely on the array interface; rather you need to translate to a direct numpy expression, I do this in expressions.evaluate
(even for the simple case of df[df>0]
); the final expression before passing to the engine cannot have anything except a valid numpy expression (e.g. arrays, operators, and scalars)
i understand the translation aspect of expressions without indexing. am i being thick about the indexing ops though? i don't understand how that is going to work without a significant amount of parser infrastructure
e.g.,
import numexpr as ne
v = randn(10)
x = v[1] # valid numpy expr
ne.evaluate('v[1]') # not a valid numexpr expression
Provide a top-level
eval
function, something like:pd.eval(function_or_string,method=None, **kwargs)
to support things like:
1) out-of-core computation (locally) (see #3202)
2) string evaluation which does not use python lazy evaluation (so pandas can process effiiently)
pd.eval('df + df2',method='numexpr')
(or maybe default to numexpr)see also: http://stackoverflow.com/questions/16527491/python-perform-operation-in-string
3) possible out-of-pandas evaluation
pd.eval('df + df2',method='picloud')
http://www.picloud.com/ (though they seem to have really old versions of pandas), but I think they handle it anyhow