nalepae / pandarallel

A simple and efficient tool to parallelize Pandas operations on all available CPUs
https://nalepae.github.io/pandarallel
BSD 3-Clause "New" or "Revised" License
3.67k stars 212 forks source link

Error when using parallel_apply on groupby! #117

Open zahra-rou opened 3 years ago

zahra-rou commented 3 years ago

I have used parallel_apply on a groupby like this:

pandarallel.initialize(progress_bar=True, verbose=2)
shopping_history.groupby(['user_id', 'date', 'item_id']).parallel_apply(self.find_same_category_items)

It used to work with original apply function, but it takes a long time. So I decided to use parallel_apply to speed up the process but now it gives an error when I use parallel_apply: TypeError: find_same_category_items() missing 1 required positional argument: 'df'

find_same_category_items() function is in the same class with the function which I call the parallel_apply from. This is the definition of the find_same_category_items():

def find_same_category_items(self, df):
        # Some process....

I'm running this on Azure ML service.

SijanShrestha7 commented 3 years ago

Have you got any solution? I am facing same error.

jonas-schulze commented 3 years ago

I'm not that savvy in Python, but maybe self.find_same_category_items is interpreted as a "function handle" instead of a "method handle" (not actually technical terms). A method call self.find_same_category_items(df) is implicitly converted to the function call find_same_category_items(self, df). Maybe pandarallel is trying to resolve your code as the latter, so you could try the more explicit

...parallel_apply(lambda gdf: self.find_same_category_items(gdf))
shermansiu commented 6 months ago

Are there still problems here? It seems like Jonas' solution resolves this issue.