kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

Use numpy and math functions inside verbs? #80

Open derekpowell opened 5 years ago

derekpowell commented 5 years ago

I'm running across errors when I try to use numpy or math functions (e.g., sqrt, log, etc) inside dfply verbs. Here's a minimal example:

import pandas as pd
from dfply import *
import numpy as np

df = pd.DataFrame({'x': np.linspace(1, 10, 500)})
df >> mutate(y = np.log(X.x))

This gives the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-f8d61ebf2e20> in <module>()
      3 df = pd.DataFrame({'x': np.linspace(1, 10, 500)})
      4 
----> 5 df >> mutate(y = np.log(X.x))

ValueError: invalid __array_struct__

Is this functionality not implemented? Maybe there's a workaround I'm not seeing?

(I'm on python 3.6.3)

sharpe5 commented 5 years ago

Hmmmm - try casting X.x to another array type that bumpy understands?

What type is X.x, can you do a type of or something similar?

On Sat, 26 Jan 2019, 18:21 Derek Powell <notifications@github.com wrote:

I'm running across errors when I try to use numpy or math functions (e.g., sqrt, log, etc) inside dfply verbs. Here's a minimal example:

import pandas as pdfrom dfply import *

df = pd.DataFrame({'x': np.linspace(1, 10, 500)}) df >> mutate(y = np.log(X.x))

This gives the error:


ValueError Traceback (most recent call last)

in () 3 df = pd.DataFrame({'x': np.linspace(1, 10, 500)}) 4 ----> 5 df >> mutate(y = np.log(X.x)) ValueError: invalid __array_struct__ Is this functionality not implemented? Maybe there's a workaround I'm not seeing? (I'm on python 3.6.3) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread .
derekpowell commented 5 years ago

Thanks for the quick response.

type(X.x) returns dfply.base.Intention type(df.x) returns pandas.core.series.Series (as expected).

In the example I gave, df.assign(y = np.log(df.x)) works fine. So I'm pretty sure it's not a problem with the array in the dataframe.

sharpe5 commented 5 years ago

I've run into this before. It will be a matter of experimenting with different type conversions until you get something that bumpy accepts.

I'm also not sure whether it's passing each cell into numpy, or the entire row as a column. There must be some way to verify that.

On Sun, 27 Jan 2019, 17:23 Derek Powell <notifications@github.com wrote:

Thanks for the quick response.

type(X.x) returns dfply.base.Intention type(df.x) returns pandas.core.series.Series (as expected).

In the example I gave, df.assign(y = np.log(df.x)) works fine. So I'm pretty sure it's not a problem with the array in the dataframe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/80#issuecomment-457936841, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypDGovieSAYZezQ7ohXcku5_zTJipks5vHeB-gaJpZM4aUTeJ .

derekpowell commented 5 years ago

I'm not an expert but I'm very confident it's not a problem with the original dataframe. Would be curious if the example I gave reproduces? Can also try an even simpler example:

df = pd.DataFrame({"x":[.1,.2,.3,.4,.5,1,2,3]})
df >> mutate(y = np.log(X.x))

That gives the same error for me. Hopefully @kieferk can solve

jankatins commented 5 years ago

The problem here is that python isn't R: R has delayed interpretation which means that the call to the log function is delayed until the function receives the dataframe as a context. Python doesn't have delayed interpretation so the interpretation order is doing the log transformation to the X.x object first and the passing the result to the mutate call. This X object usually simualtes delayed interpretation by kind of recording your intend (mutate(z=X.x*X.y): "multiply the x colum of the passed in dataframe with the y column"). The mutate gets this recording and executes it in the context of the real dataframe.

The problem is when a function doesn't know about it, as in this case the np.log function. It expects an array (which is why df.x works) and gets the "recorder" object.

What might work is a X.x.log().

derekpowell commented 5 years ago

Aha, this is what I feared. That's unfortunate, definitely limits the utility of the mutate() functions in dfply.

I've also been playing with the plydata package which can handle these kinds of operations. In plydata, computations are passed as strings, e.g. mutate(y = "np.log(x)"). This isn't necessarily more elegant but seems it's allowed them to make these kinds of operations work properly. Unfortunately, it's currently a bit less complete wrt the verbs available in the tidyverse (e.g., currently missing gather() and spread()) that dfply has covered very well.

germayneng commented 5 years ago

i think a workaround could be:

df >> mutate(y_log = np.log(df['y']))

omrihar commented 4 years ago

I came across this issue because I was searching for the exact same problem. After reading the documentation I noticed this is actually addressed, and the correct way to solve this would be:

@make_symbolic
def log(series):
    return np.log(series)

df >> mutate(y_log = log(X.y))

I can verify this works without a problem!

tonyduan commented 4 years ago

I've run into this issue as well and it'd be awesome if we could add @omrihar's solution into the codebase!