pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.43k stars 17.85k forks source link

ENH: eval function #3393

Closed jreback closed 11 years ago

jreback commented 11 years ago

Provide a top-level eval function, something like:

pd.eval(function_or_string,method=None, **kwargs)

to support things like:

1) out-of-core computation (locally) (see #3202)

2) string evaluation which does not use python lazy evaluation (so pandas can process effiiently)

pd.eval('df + df2',method='numexpr') (or maybe default to numexpr)

see also: http://stackoverflow.com/questions/16527491/python-perform-operation-in-string

3) possible out-of-pandas evaluation pd.eval('df + df2',method='picloud') http://www.picloud.com/ (though they seem to have really old versions of pandas), but I think they handle it anyhow

hayd commented 11 years ago

To follow on from your comment, we shouldn't we be using & and |? I think this may also have the benefit of all and any just working.

Also ~ for not/invert (since would make it the same as numpy).

I haven't got my head around numexpr yet, so I may be talking complete nonsense. (I've moved Term to expressions without breaking things, and changed the repr to eval back to itself (was there a reason for it not?).

jreback commented 11 years ago

I agree about the operators (though I think you actually need to accept both), these are always going to be in a string expression in any event....because you need delayed evaluation but since we actually DO want the & etc...you can just replace them (e.g. this is really a user interface issue), we are not actually going to evaluate them

e.g.

df[(df > 0) and (df < 10)]

vs

df['(df > 0) and (df < 10)]']
cpcloud commented 11 years ago

i'm sure everyone involved in this thread knows this but just wanted to point out that the precedence of and and & is different. if i was a first time user i would think that df > 0 and df < 10 and df > 0 & df < 10 do the same thing, so if both are going to be supported i think precedence rules should be kept as close to python as possible meaning parens are required for & but not for and.

jreback commented 11 years ago

@cpcloud this is in a string context, so in theory you can make them the same (as this is a big confusion I think in operating in pandas, I think people expect them to be the same (even though they are wrong)

cpcloud commented 11 years ago

@jreback sure. i was just semi-thinking-out-loud here, thought that it might warrant a discussion. this goes back to python core devs not wanting to provide the ability to overload not and and or so numpy was forced to overload bitwise operators for boolean operations (there's a youtube video of a discussion about this with GVR, there's even a patch to core python that allows you to do this). i really wish that pep went through sigh. i didn't realize there was a big confusion here, since this really has nothing to do with pandas, it's a language feature/bug. i was just thinking that adding more parsing rules to remember is annoying to users.

jreback commented 11 years ago

it's a valid point

the purpose of eval is to facilitate multi expression parsing that we will evaluate in Numexpr so we have to have a string input (to avoid python evaluation of the sub expressions) or maybe there is a way to disable this (like how %timeit works in ipython) but i think they r using an input hook and hence everything really is a string

cpcloud commented 11 years ago

@jreback u can do it with the cmd module too. i think ipython used to use that or maybe they still do. i think only macros would allow you do this without string literals. btw there is now a Python macros library. i haven't tried it out but it looks like fun. another possibility is to support numba as a method although first things first (numexpr). do u already have something going for this?

jreback commented 11 years ago

@hayd said he was giving a stab Andy can u post a link to your branch?

jreback commented 11 years ago

@cpcloud numba is interesting, but the infrastructure requirement is high, and in any event, its basically using numexpr under the hood :) (as well as ctable for storage)

jreback commented 11 years ago

@cpcloud I reread your question

the issue is this: df[(df>0) & (df<10)] is evaluated as 3 separate sub-expressions, plus a boolean mask selection

while

df.eval('(df>0) & (df<10)') can be evaluated (after alignment) in a single numexpr pass (and then a boolean mask) to return the dataframe, so can be a massive speedup

that's the main reason for this function

cpcloud commented 11 years ago

@jreback that is pretty cool. i haven't done much with numexpr, i assumed that pandas uses it when it can...is that a fallacious assumption? should i be explicitly using numexpr?

jreback commented 11 years ago

it's used in pytables for query processing and in most evaluations now as of 0.11 (you need a fairly big frame for it to matter)

see the core/expressions module

hayd commented 11 years ago

I haven't done much so far, I've moved Term to expressions and added some helper functions for that class, not have I really looked in to numexp yet.

I kind of lost my way on the road map... and may be totally confused atm.

Am I way off here?

1. move term to expression

  1. create class for "termset" (not sure what name, I was thinking this would be a list (possibly of termsets) with a flag whether it was all/any).
  2. work out how to process termsets strings numpexp (is this the tricky part?)
  3. create method for "termset" to strings which can be processed to numexp e.g.
  4. create parser for our DSL to termset e.g. '(df>0) & (df<10)' -> [Term(df, '>', 0), Term(df, '<', 10)]
jreback commented 11 years ago

so there are 3 goals here:

1) parser to turn:

 'df[(df>0)&(df<10)]'

into this (call this the parsed_expr)

df[_And(Term('df','>',0),Term('df','<','10'))]

2) take a parsed_expr, align the variables (e.g. perform stuff like what combine_frame, combine_series, combine_scalar does (e.g. the alignment/reindexing steps), call this the aligned_expr

3) take aligned_expr and turn this into a numexpr expression (like what Term does and the expressions module does (though its very simple), this would be an exapansion of expresssions to take in an aligned Terms with their boolean operators (e.g. _And/_Or/_Not and parens)

1) involves tokenizing/ast manip (kind of like numexpr.evaluate does) to form the Terms; I am not sure how tricky this is, so we were going to skip for now

2) this is straightforward: take the parsed_expr and substitue variables that are aligned (keep frames as frames), don't need to exapand scalars at allow, mainly just reindex things that need, create the aligned_expr

3) this is straightforward too, just take the term expressions and generate the numexpr itself

so I think termset is really Term, plus the boolean operators, and a grouping operator (the parens) these just allow easy expression manip (your 2) your 3 (skip for now, that's my 1)

your 4 is my 3

I don't think you need 5

cpcloud commented 11 years ago

@jreback i know u said skip 1 but i can do that if u want (lots of nice python builtins for dealing with python source) while @hayd does 2 and 3. what would be allowed to be parsed? exprs in the python grammar? or just boolean exprs? could start with booleans fornow and extend after that is working...

jreback commented 11 years ago

the more there merrier!

let's start with the example

df.eval('(df>0)&(df<10)')

This is really about the masks as that's where all the work is done

but I think it would be nice evenutally to do something on the rhs as well:

pd.eval('df[(df>0)&(df<10)] = df * 2 + 1', engine='numexpr')

so we can support getitem and setitem and pass both the lhs and rhs to the evaluator

(imagine engine = 'multi-process' or 'out-of-core')......

to the heck with blaze! (well maybe engine=blaze is ok too)

hayd commented 11 years ago

I think I was worried that nested Terms wouldn't come for free with _And and _Or, but I'll put something together imminently and we can see whether it does. :)

hayd commented 11 years ago

We can just tell everyone it's blaze...

cpcloud commented 11 years ago

i've got it parsing nested and terms already :)

cpcloud commented 11 years ago

albeit they are strings right now and only & (parsing and is different), i haven't written the _And class yet

jreback commented 11 years ago

@cpcloud I would just use the &, |, and ~ for now (to keep consistent), can always add later

jreback commented 11 years ago

@cpcloud

the end goal is to create a numexpr expression (the functionaility is in the Selection class in io/pytables.py); so the class that holds the parses expression (the nested _And/_Or) should parse to this (and has to do type translation and such), also this class could do the alignment I think (which is the reason for having the parsed expression, so you can basically just iterate thru all of the terms and see what needs to be aligned)

e.g.

for t in term_expression:
      t.align()

Term align (pseudo codish)

def align(self):
      self.lhs
      self.op
      self.rhs

      if self.lhs ia DataFrame:
           if self.rhs is a Series....
                     is a Frame
                     is a Scalar

maybe return a new expression that is aligned

cpcloud commented 11 years ago

ah i see. so an Expr class should hold the ands and ors which consist of terms (or nested expressions). Expr could have an align method which aligns and then passes to numexpr. is that correct?

jreback commented 11 years ago

I think you actually need 3 classes here:

1) Term which holds lhs operator rhs (and prob a reference back to the top-level Expr for variable inference 2)Termset, although maybeExpr, or maybeTerms? is better here (I mean a nested list of_and,_or,_notoperators on theTerms) 3) Top-level, maybe Expression, which holds 2) the termset, and the engine and such

e.g.

pd.eval('df[(df>0)&(df<10)'])

yields

Expression():
    original string
    df[mask] (you need to keep this where)
    termset of the boolean expression
    engine
    maybe an environment points (this is like a closure) but we are not fancy here :)

    methods:
        parse (create the termset)
        align (have the termset align)
        convert_to_engine_format (return the converted termset)
Termset():
     _and(Term('df','>',0),Term('df','<',10))
     methods:
          align (maybe return a new termset that is aligned)
          convert_to_engine_format (return the converted to engine format,
              this would be a string)
cpcloud commented 11 years ago

lol gh doesn't like ur rst flavored monospace

hayd commented 11 years ago

This was where I was up to: https://github.com/hayd/pandas/tree/term-refactor

cpcloud commented 11 years ago

possible engines right now are 'numexpr' and 'pytables'?

jreback commented 11 years ago

well....pytables target is the same, numexpr, only difference is that the Terms need to do different alignment (as they are scalar type conditions, e.g. index>20130523, where index is a field in the table, and the date gets translated to i8; so do need support for that (so yes you could use engine=pytables) to handle that, but in pytables need to have what I call the queryables dict passed in anyhow for validation (whereas in the case of a boolean expression you have the df passed in) (or taken from the locals())

cpcloud commented 11 years ago

@jreback @hayd fyi for some reason expressions.py has dos line endings while, for example, frame.py does not. isn't git supposed to take of this? it's pretty annoying and will cause a billion and one merge conflicts...it's just that file: i just ran dos2unix on all of pandas and that's the only thing changed. i did this after a fresh clone

jreback commented 11 years ago

that was might fault I had the wrong setting on git which used the widows line endings (I change my git so now it won't change them) I actually edit using xemacs on a pc even though I do everything on Linux so u could change them if u want no big deal

cpcloud commented 11 years ago

oh ok cool thanks

cpcloud commented 11 years ago

@jreback by align here you're referring to something slightly different from the align methods on frames and series right? for example, the expression df > 0 will result in a Term('df', '>', [0]) object. aligning a list to a df could be done by conversion to Series but then alignment will result in the original df and a Series with a bunch of NaNs except where the non-nan values were in the original list

jreback commented 11 years ago

the entire expression will be passed as numpy values when I say alignment I mean anything that needs reindexing or some sort of conversions needs to be done

eg see methods combineframe combine series combine_scalar

so in your example nothing needs to be aligned (as a scalar can be passed directly)

step thru what happens when you do

df1 + df2 on frames that overlap (but are not identical)

see what is passed to expressions.evaluate

that's what the end result, so alignment needs to make the shapes match up (but for example scalars don't need any treatment)

cpcloud commented 11 years ago

ah i c. totally clear now thanks

cpcloud commented 11 years ago

i don't know why i didn't think of this before but align will work here.

jreback commented 11 years ago

for sure

BUT you will need to apply it to the entire expression (it's meant for a binary argument now. self and other)

but I don't think h need it directly really just should use the combine_* functions as a guide

cpcloud commented 11 years ago

Right now align works recursively over binary operators and their operands until a Term is reached. I'm guessing that align might be inefficient in space for large frames there. Need to check. On May 28, 2013 7:15 PM, "jreback" notifications@github.com wrote:

for sure

BUT you will need to apply it to the entire expression (it's meant for a binary argument now. self and other)

but I don't think h need it directly really just should use the combine_* functions as a guide

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/3393#issuecomment-18586504 .

jreback commented 11 years ago

ok remember there is only going to be a small number of ops/ terms (compared to the size of frames, in general)

cpcloud commented 11 years ago

Ah true. Very long expressions will be rare. On May 28, 2013 7:23 PM, "jreback" notifications@github.com wrote:

ok remember there is only going to be a small number of ops/ terms (compared to the size of frames, in general)

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/3393#issuecomment-18586797 .

cpcloud commented 11 years ago

hm how to deal with calls to __getitem__ here numexpr doesn't like these. i've got it evaling expressions without getitem calls, sadly i had to pretty much break all of the existing pytables functionality (current Term class is not general enough)

jreback commented 11 years ago

you have to translate everything to numpy values Numexpr shouldn't see anything except numpy, operators, and scalars

cpcloud commented 11 years ago

numexpr doesn't support indexing and it won't ever. i suppose i could do the slicing at the last minute and special case the parser for that

jreback commented 11 years ago

that's the point of tr alignment step you need to create a string expression with only numpy arrays, operators, and strings broadcasting is ok (eg let numpy do it), but nothing else you may need to do dtype conversions (maybe)

cpcloud commented 11 years ago

ok. then for now while i'm working on this i'll ignore things like df[index1][index2] or attribute indexers and such.

jreback commented 11 years ago

well as long as df -> df.values and index1 and index2 were converted it could be ok in any event we just want to start with a subset of operations u can also bail out and just evaluate like normal

cpcloud commented 11 years ago

@jreback a few thoughts...

u can also bail out and just evaluate like normal

probably not a good idea to evaluate like normal since that would require the use of eval.

well as long as df -> df.values and index1 and index2 were converted it could be ok

numexpr does use the array interface so df -> df.values is ok, but you cannot pass an expression with get/setitem calls. by converted i'm guessing you mean the indices are converted to numpy arrays which they will be as long as they implement the array interface.

the issue of get/setitem calls remains and i will have to address for this to be at all useful. to handle those calls the parser would need to special case it, but only in the case of the numexpr engine, e.g., a future engine might be able to handle those calls without any need to special case the parsing e.g., blaze, numba, picloud, etc. should i do that? if so then a set of Parser classes is in order i think probably one for each engine where there is a default implementation of get/setitem expression parsing...

cpcloud commented 11 years ago

actually, i think it might be simpler to start with df.eval('df > 0', engine='numexpr') then to start with messing with Term and friends, since i could just eval the expression with whatever engine and then pass to df[the_engine_evald_expr]

jreback commented 11 years ago

You can't use/rely on the array interface; rather you need to translate to a direct numpy expression, I do this in expressions.evaluate (even for the simple case of df[df>0]); the final expression before passing to the engine cannot have anything except a valid numpy expression (e.g. arrays, operators, and scalars)

cpcloud commented 11 years ago

i understand the translation aspect of expressions without indexing. am i being thick about the indexing ops though? i don't understand how that is going to work without a significant amount of parser infrastructure

cpcloud commented 11 years ago

e.g.,

import numexpr as ne
v = randn(10)
x = v[1] # valid numpy expr
ne.evaluate('v[1]') # not a valid numexpr expression