jmcarpenter2 / swifter

A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner
MIT License
2.52k stars 101 forks source link

swifter apply on groupby objects #48

Closed Vozf closed 2 years ago

Vozf commented 5 years ago

It would be nice to be able to apply on groupby objects currently error is raised

Cannot access attribute 'swifter' of 'DataFrameGroupBy'
jmcarpenter2 commented 5 years ago

Hi @Vozf,

Thanks for raising this feature request.

As mentioned in my other response, I'll be spending some time this Saturday working on swifter.

I previously had implemented a groupby apply functionality, but unfortunately it was slower than pandas. Another user, @guyskk, suggested a different approach for increasing speed of groupby in #26. I will see if implementing that approach works as effectively as I hope it will.

Thanks, Jason

jmcarpenter2 commented 5 years ago

Hi @Vozf,

Thanks for raising this feature request.

As mentioned in my other response, I'll be spending some time this Saturday working on swifter.

I previously had implemented a groupby apply functionality, but unfortunately it was slower than pandas. Another user, @guyskk, suggested a different approach for increasing speed of groupby in #26. I will see if implementing that approach works as effectively as I hope it will.

Thanks, Jason

samuelefiorini commented 5 years ago

Hi @jmcarpenter2, any news on using swifter on SeriesGroupBy?

jmcarpenter2 commented 5 years ago

@samuelefiorini , thanks for the reminder!

I apologize for the delay. I've been kept busy lately and haven't had a ton of time to allocate to new features for swifter. I'll allocate some time for this in the near future.

Thanks, Jason

Juanouo commented 4 years ago

Hey @jmcarpenter2 ! Sorry to annoy you, but were you able to allocate some time? I just discovered swifter and was happy to accelerate my apply , but it was on a groupby object, so I wasn't able to try it.

jmcarpenter2 commented 4 years ago

Hi @Juanouo,

It's no bother at all. I have been working on the groupby apply but have run into some issues during implementation. I have some time this week, so I'll provide an update of my progress by the weekend.

Thanks, Jason

jmcarpenter2 commented 4 years ago

Limited progress so far.... It turns out this is a very non-trivial problem to solve as simply using dask is slower than pandas in most cases. And even trying to get fancy with how I use dask/pandas hasn't been very effective. I will keep plugging away to see if I can figure this out as I recognize that this is a high priority for users of swifter.

Juanouo commented 4 years ago

Thanks a lot Jason, I totally appreciate the effort, and take all the time that's needed

sushmit86 commented 4 years ago

I was also kind of trying some similar thing with dask https://stackoverflow.com/questions/59759521/use-dask-to-calculate-moving-average/59761508?noredirect=1#comment105721236_59761508

I do not see any significant improvements using dask.

LukasHaas commented 4 years ago

Is there any update on this?

abzhaobo commented 4 years ago

I'm also quite interested in this, as groupby + apply is quite common. Sometimes it would be annoying to ponder on how to vectorize or how to use numpy/cython/numba while pandas provides quite straightforward solutions.

ghazanfarj commented 4 years ago

excited about swifter groupby apply whenever it's here.

kmedved commented 4 years ago

Also excited about this functionality. Nonvectorized groupby apply operations are frustratingly common unfortunately.

RonakAgrawal77 commented 4 years ago

Thank you so much for contributing. Excited

tomerher commented 4 years ago

Hi, was there a solve for the groupby+apply issue? This seems really useful. Thanks!

diditforlulz273 commented 3 years ago

I'd love to see it too!

diditforlulz273 commented 3 years ago

@jmcarpenter2 by the way, I succeeded in parallelizing groupby-apply manually with Ray only. It works even with mp.pool, but Ray is around 15% faster due to a more efficient data communication way. The idea is to first split your dataframe into chunks based on one of the/all the columns you want to perform .groupby on, and then feed it to a number of ray workers with a standard 1-threaded pandas .groupby.apply. Use pd.concat after.

Another trick is to set internal Ray's dataframe data state back to mutable to avoid unnecessary .copy() in and out of func. While .copy() is practically free in terms of computation time, it ruins everything with a 2x memory overhead, which you can't afford working with large data amounts.

running it 4-threaded cuts the time by a factor of 2.5, 12 threads cut is 5 times. here I'm calclulating .ewm on a grouped series:

df - is the dataframe we want to process item_id, store_id - columns to groupby on

https://gist.github.com/diditforlulz273/06ffa5f5b1c00830671ce0330851352f

babameme commented 3 years ago

Hi @jmcarpenter2, is there any update on swifter groupby than apply function I also need groupby -> rolling -> apply function. Thanks

MikiGrit commented 3 years ago

There is already library doing groupby -> apply parallelization (https://github.com/nalepae/pandarallel/). But I would be definitely interested in using just one parallelization library for all pandas cases than bunch of them. Any progress recently?

Thanks for your work already btw! :+1:

samuelefiorini commented 3 years ago

Totally agree with you @MikiGrit. Pandarallel is cool but it doesn't support Windows (outside WSL).

quancore commented 3 years ago

Any progress?

jmcarpenter2 commented 2 years ago

Hey everyone, thanks for the interest in a swifter groupby apply!!

I want to update the group that I have tried many different approaches (including the ray approach listed above, as well as literally every approach mentioned in this stack overflow post), and across multiple test cases I have not been able to find a single solution that provides actual performance gain over a simple pandas groupby.apply when a user provides an arbitrary function or lambda.

I am 1000000% aspiring to provide this functionality, but I would hate to put something out there that not only doesn't speed up performance, but actually slows down groupby applies for users (while giving the impression that it could speed them up)

Stay tuned as I am NOT GIVING UP! But I just wanted to update everyone because I know this has been a long-awaited feature that we unfortunately are still waiting on

THANK YOU ALL FOR YOUR PATIENCE!!!

jmcarpenter2 commented 2 years ago

Hi all!

TLDR: Groupby-Apply is now available in swifter[groupby]==1.3.2

After the previous post, I figured I had to go back and try @diditforlulz273's solution just one more time to see if I could get some of the purported performance benefit. As it turns out, I had missed a key piece of the solution during my own implementation which ultimately was messing with my results.

Once I fixed the bug in my code, I started seeing sizable performance improvements, as you can see in the following image: groupby_parallel_v_single_real

I am so thankful to @diditforlulz273 for providing a clear gist that I could implement, and also to everyone here for continuing to encourage and express interest in this feature of swifter. I am so pleased to be able to provide this new capability for users of swifter. Please let me know if you run into issues using it. And once again, thank you all for your patience over the past THREE YEARS!!!

Installation:

$ pip install -U swifter[groupby]
or
$ pip install -U swifter[groupby]==1.3.2

Usage:

df.swifter.groupby("group").apply(func)
or
df.swifter.groupby(["group_1", "group_2"]).apply(func)

CC: @Vozf , @samuelefiorini, @Juanouo, @sushmit86, @LukasHaas, @abzhaobo, @ghazanfarj, @kmedved, @RonakAgrawal77, @tomerher, @babameme, @MikiGrit, @quancore, and anyone else who is quietly "watching" this thread :)

diditforlulz273 commented 2 years ago

@jmcarpenter2 I wrote this code quite a time ago, and now I can say you should be afraid of making data mutable in the hard way I gave in my gist, namely:

    # Ray makes data immutable when stored in its memory.
    # This approach prevents state sharing among processes, but we have a separate chunk for each process
    # to get rid of copying data, we make it mutable in-place again by this hack
    for d in range(len(df._data.blocks)):
        try:
            df._data.blocks[d].values.flags.writeable = True
        except Exception:
            pass

It might crash in very rare cases. But might not :) Just watch out, if people post issues about it - turn it off, as far as I remember this wouldn't affect speed, only memory footprint.

jmcarpenter2 commented 2 years ago

Thanks for the heads up, I'll be sure to remove it if people raise any issues related to crashing :) unless you think I should just straight up remove it regardless?

yudhiesh commented 2 years ago

@jmcarpenter2 I am facing an issue where I can't filter out certain columns when running a groupby and apply:

(df
 .grouby('id')[list_of_columns]
 .apply(func)
)

I get the following error: TypeError: 'GroupBy' object is not subscriptable. Is this a bug or to be expected?

jmcarpenter2 commented 2 years ago

Ooooh good call @yudhiesh, this is a common use-case for groupby applies. I will take a look into this. I believe enabling this type of functionality is possible.

jmcarpenter2 commented 2 years ago

Hey @yudhiesh , following up here that this df.swifter.groupby(by)[key].apply(func) functionality will be added in v1.3.4, to be released later today once the CI/CD completes for PR #199

yudhiesh commented 2 years ago

@jmcarpenter2 thank you for the swift response and PR!

firmai commented 2 years ago

Thanks for this, I apply a rolling functionality afterwards and obtain the following: AttributeError: 'Rolling' object has no attribute 'agg'

df.swifter.groupby(df[ticker])[columns].rolling(i).agg(functions)

AlexanderTrg commented 1 year ago

hey @jmcarpenter2 df.swifter.groupby(by)[key].apply(func) doesn't work wish that error File "/home/++++/.local/lib/python3.6/site-packages/swifter/swifter.py", line 623, in _ray_groupby_apply_chunk grpby = chunk.groupby(by, axis=self._axis, **self._grpby_kwargs) TypeError: groupby() got an unexpected keyword argument 'dropna'

zimmerling commented 1 year ago

Hey @jmcarpenter2, I get a TypeError when I run this code:

df.swifter.groupby(pd.Grouper(key='start_date', freq='1H')).apply(foo)

File ~/++++/.venv/lib/python3.10/site-packages/swifter/swifter.py:592, in GroupBy._get_chunks(self) 591 def _get_chunks(self): --> 592 subset_df = self._obj_pd.index if self._grpby_index else self._obj_pd[self._by[0]] 593 unique_groups = subset_df.unique() 594 n_splits = min(len(unique_groups), self._npartitions) 'TimeGrouper' object is not subscriptable

jmcarpenter2 commented 1 year ago

Hey @Zimmerling , thanks for flagging this. I need to expand support for pd.Grouper objects. Currently, the groupby only support the by argument for columns or index. But use-case is very clear from your example, thanks for showing that. I will work on this soon