kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

Mutate with boolean expressions #71

Open jonvitale opened 5 years ago

jonvitale commented 5 years ago

Hi, thank you for all the good work here, I like this the best of the dplyr clones.

In R I am able to do something like, df %>% mutate(newcol = ifelse(x > 3 & lead(y) < 2, 'yes', 'no')

In Python it seems that I should be using the numpy.where function. I also read enough of your documentation to realize I need to wrap this function in another function with the @make_symbolic decorator. So, I have this:

@make_symbolic
def np_where(bools, val_if_true, val_if_false):
    return list(np.where(bools, val_if_true, val_if_false))

When I call it like this, it works just fine: df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F', 'Punct, 'Not Punc')

However if I want to make my expression to evaluate to True or False more complex with ands or ors, I get an error: df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F' & X.CPOS == 'F', 'Punct, 'Not Punc') also tried with: df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F' and X.CPOS == 'F', 'Punct, 'Not Punc')

I get this error: TypeError: index returned non-int (type Intention)

I thought that my @make_symbolic decorator took care of this kind of thing. Perhaps I need a logical and that also has the delaying decorator.

grst commented 5 years ago

I believe this is part of a larger problem: any kind of standard python functions that have not specifically been adapted for dfply do not work. Take for example joining multiple str columns together:

df >> mutate(new_col = "_".join([col1, col2, col3]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-4cac91689aba> in <module>
----> 1 ap_raw >> mutate(cell_name = "_".join([X.patient, X.cellid]))

TypeError: sequence item 0: expected str instance, Intention found
jwdink commented 5 years ago

I think the problem is just that in python the order of operations is different. If you wrapped your conditions with parentheses I believe it would work.

E.g. this works:

from dfply import *

@make_symbolic
def np_where(bools, val_if_true, val_if_false):
    return np.where(bools, val_if_true, val_if_false)

df = pd.DataFrame({'cond1' : [0,1], 'cond2' : [1,0]})
df >> mutate(result = np_where((X.cond1 == 1) & (X.cond2 == 1), 5, 2))
jwdink commented 5 years ago

Also note that the package seems to have an if_else function built in. See https://github.com/kieferk/dfply/blob/master/dfply/vector.py. Although it seems to use a list-comprehension instead of np.where, so could potentially be slower than needed.