kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

Example request #59

Open hhoeflin opened 6 years ago

hhoeflin commented 6 years ago

Hi,

I am an avid dplyr user in R and somewhat new to python. I have been looking for a dplyr-like package in python for a while when I came across dfply which looks pretty close to what I was looking for.

Please excuse if this is not quite the right forum, but I was looking for some help/request some documentation/request a feature.

My use case essentially is that I have a function that operates on single elements of a data-frame columns, e.g.

my_func(a,b)

where both a and b are single elements from columns of a data frame. I have found a stackoverflow-post that shows this for an operation on a single column only.

https://stackoverflow.com/questions/42671168/dfply-mutating-string-column-typeerror

The solution show here of using X.file.apply for the column X.file in the data-frame seems to only work when you only have a single column to operate on.

What i was essentially wondering is - how do you recommend to best use dfply in this context? Could you add some documentation on how best to use functions that don't natively understand Series objects?

E.g. could there be an "Intention" like object that takes a function that operators on several parameters, each of which is intended to be a single element from a column, "vectorizes" this function and then when passed an intention object representing a "Series", applies this appropriately?

Thanks for your help!

sharpe5 commented 6 years ago

You can use custom functions for this. I've tried it and it works well.

Pass the parameters into the function. The parameters will be passed in as vectors. Debug your custom function until it takes/returns vectors of inputs as well as scalars. If you run into problems finding the right API calls to operate on vectors, just loop over the inputs.

Remember to annotate your function with @make_symbolic, see section in dfply readme "Extending dfply with custom functions".

If you run into speed problems with huge multi million row dataframes, you can rewrite it to use straight Pandas with Numba for JIT compliling.

I would provide a code sample, but I am not at my PC right now. Good luck!

sharpe5 commented 6 years ago

Numba will take a function designed for scalars, and after an annotation, produce a function that will take vectors. It will JIT compile the function in the background to C using LLVM, so it will run at the same speed as C code.

To use this with dfply, you will have to have two functions: the outer one is annotated with @make_symbolic, this calls the inner one which has been vectorised with Numba.

To use this with Pandas, you can call the function directly so it is somewhat faster. However, dfply is so much cleaner and easier to code with that its not worth doing this unless profiling shows a speed bottleneck.

hhoeflin commented 6 years ago

thanks for this, I saw that this would work - but it is somewhat cumbersome in my mind compared to native dplyr in R. Essentially with this, much of the nicety and ease of flow of writing dplyr goes away.

Would there be a possibility of an explicit "Intention" based object that would do exactly this, i.e. a wrapper EF i.e. for element_function), that wraps a function that takes elements of a vector as arguments, automatically iterates over the series element?

It is just that any requirement to have "decorates" breaks the natural piping flow of dplyr, and having to prepare it all beforehand breaks the reading-flow of the code.

sharpe5 commented 6 years ago

In my experience, 90% of dplyr does not need custom functions. If custom functions are required, then they are quite fast to use - just copy, paste and modify another working function you already have.

In my opinion, dplyr is just as clean and usable as plyr.

On Thu, 19 Jul 2018 13:29 hhoeflin, notifications@github.com wrote:

thanks for this, I saw that this would work - but it is incredibly cumbersome. Essentially with this, much of the nicety and ease of flow of writing dplyr goes away.

Would there be a possibility of an explicit "Intention" based object that would do exactly this, i.e. a wrapper EF i.e. for element_function), that wraps a function that takes elements of a vector as arguments, automatically iterates over the series element?

It is just that any requirement to have "decorates" breaks the natural piping flow of dplyr, and having to prepare it all beforehand breaks the reading-flow of the code.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/59#issuecomment-406259238, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypBECimk6e1V9vue5zpvEyO78Dbxxks5uIHuOgaJpZM4VWCzD .