Closed Vozf closed 2 years ago
Hi @Vozf,
Thanks for raising this feature request.
As mentioned in my other response, I'll be spending some time this Saturday working on swifter
.
I previously had implemented a groupby apply functionality, but unfortunately it was slower than pandas. Another user, @guyskk, suggested a different approach for increasing speed of groupby in #26. I will see if implementing that approach works as effectively as I hope it will.
Thanks, Jason
Hi @Vozf,
Thanks for raising this feature request.
As mentioned in my other response, I'll be spending some time this Saturday working on swifter
.
I previously had implemented a groupby apply functionality, but unfortunately it was slower than pandas. Another user, @guyskk, suggested a different approach for increasing speed of groupby in #26. I will see if implementing that approach works as effectively as I hope it will.
Thanks, Jason
Hi @jmcarpenter2, any news on using swifter
on SeriesGroupBy
?
@samuelefiorini , thanks for the reminder!
I apologize for the delay. I've been kept busy lately and haven't had a ton of time to allocate to new features for swifter. I'll allocate some time for this in the near future.
Thanks, Jason
Hey @jmcarpenter2 ! Sorry to annoy you, but were you able to allocate some time? I just discovered swifter
and was happy to accelerate my apply
, but it was on a groupby
object, so I wasn't able to try it.
Hi @Juanouo,
It's no bother at all. I have been working on the groupby apply but have run into some issues during implementation. I have some time this week, so I'll provide an update of my progress by the weekend.
Thanks, Jason
Limited progress so far.... It turns out this is a very non-trivial problem to solve as simply using dask is slower than pandas in most cases. And even trying to get fancy with how I use dask/pandas hasn't been very effective. I will keep plugging away to see if I can figure this out as I recognize that this is a high priority for users of swifter.
Thanks a lot Jason, I totally appreciate the effort, and take all the time that's needed
I was also kind of trying some similar thing with dask https://stackoverflow.com/questions/59759521/use-dask-to-calculate-moving-average/59761508?noredirect=1#comment105721236_59761508
I do not see any significant improvements using dask.
Is there any update on this?
I'm also quite interested in this, as groupby + apply is quite common. Sometimes it would be annoying to ponder on how to vectorize or how to use numpy/cython/numba while pandas provides quite straightforward solutions.
excited about swifter groupby apply whenever it's here.
Also excited about this functionality. Nonvectorized groupby apply operations are frustratingly common unfortunately.
Thank you so much for contributing. Excited
Hi, was there a solve for the groupby+apply issue? This seems really useful. Thanks!
I'd love to see it too!
@jmcarpenter2 by the way, I succeeded in parallelizing groupby-apply manually with Ray only. It works even with mp.pool, but Ray is around 15% faster due to a more efficient data communication way. The idea is to first split your dataframe into chunks based on one of the/all the columns you want to perform .groupby on, and then feed it to a number of ray workers with a standard 1-threaded pandas .groupby.apply. Use pd.concat after.
Another trick is to set internal Ray's dataframe data state back to mutable to avoid unnecessary .copy() in and out of func. While .copy() is practically free in terms of computation time, it ruins everything with a 2x memory overhead, which you can't afford working with large data amounts.
running it 4-threaded cuts the time by a factor of 2.5, 12 threads cut is 5 times. here I'm calclulating .ewm on a grouped series:
df - is the dataframe we want to process item_id, store_id - columns to groupby on
https://gist.github.com/diditforlulz273/06ffa5f5b1c00830671ce0330851352f
Hi @jmcarpenter2, is there any update on swifter groupby than apply function I also need groupby -> rolling -> apply function. Thanks
There is already library doing groupby -> apply parallelization (https://github.com/nalepae/pandarallel/). But I would be definitely interested in using just one parallelization library for all pandas cases than bunch of them. Any progress recently?
Thanks for your work already btw! :+1:
Totally agree with you @MikiGrit. Pandarallel is cool but it doesn't support Windows (outside WSL).
Any progress?
Hey everyone, thanks for the interest in a swifter groupby apply!!
I want to update the group that I have tried many different approaches (including the ray approach listed above, as well as literally every approach mentioned in this stack overflow post), and across multiple test cases I have not been able to find a single solution that provides actual performance gain over a simple pandas groupby.apply when a user provides an arbitrary function or lambda.
I am 1000000% aspiring to provide this functionality, but I would hate to put something out there that not only doesn't speed up performance, but actually slows down groupby applies for users (while giving the impression that it could speed them up)
Stay tuned as I am NOT GIVING UP! But I just wanted to update everyone because I know this has been a long-awaited feature that we unfortunately are still waiting on
THANK YOU ALL FOR YOUR PATIENCE!!!
Hi all!
TLDR: Groupby-Apply is now available in swifter[groupby]==1.3.2
After the previous post, I figured I had to go back and try @diditforlulz273's solution just one more time to see if I could get some of the purported performance benefit. As it turns out, I had missed a key piece of the solution during my own implementation which ultimately was messing with my results.
Once I fixed the bug in my code, I started seeing sizable performance improvements, as you can see in the following image:
I am so thankful to @diditforlulz273 for providing a clear gist that I could implement, and also to everyone here for continuing to encourage and express interest in this feature of swifter. I am so pleased to be able to provide this new capability for users of swifter. Please let me know if you run into issues using it. And once again, thank you all for your patience over the past THREE YEARS!!!
Installation:
$ pip install -U swifter[groupby]
or
$ pip install -U swifter[groupby]==1.3.2
Usage:
df.swifter.groupby("group").apply(func)
or
df.swifter.groupby(["group_1", "group_2"]).apply(func)
CC: @Vozf , @samuelefiorini, @Juanouo, @sushmit86, @LukasHaas, @abzhaobo, @ghazanfarj, @kmedved, @RonakAgrawal77, @tomerher, @babameme, @MikiGrit, @quancore, and anyone else who is quietly "watching" this thread :)
@jmcarpenter2 I wrote this code quite a time ago, and now I can say you should be afraid of making data mutable in the hard way I gave in my gist, namely:
# Ray makes data immutable when stored in its memory.
# This approach prevents state sharing among processes, but we have a separate chunk for each process
# to get rid of copying data, we make it mutable in-place again by this hack
for d in range(len(df._data.blocks)):
try:
df._data.blocks[d].values.flags.writeable = True
except Exception:
pass
It might crash in very rare cases. But might not :) Just watch out, if people post issues about it - turn it off, as far as I remember this wouldn't affect speed, only memory footprint.
Thanks for the heads up, I'll be sure to remove it if people raise any issues related to crashing :) unless you think I should just straight up remove it regardless?
@jmcarpenter2 I am facing an issue where I can't filter out certain columns when running a groupby and apply:
(df
.grouby('id')[list_of_columns]
.apply(func)
)
I get the following error: TypeError: 'GroupBy' object is not subscriptable
. Is this a bug or to be expected?
Ooooh good call @yudhiesh, this is a common use-case for groupby applies. I will take a look into this. I believe enabling this type of functionality is possible.
Hey @yudhiesh , following up here that this df.swifter.groupby(by)[key].apply(func)
functionality will be added in v1.3.4
, to be released later today once the CI/CD completes for PR #199
@jmcarpenter2 thank you for the swift response and PR!
Thanks for this, I apply a rolling functionality afterwards and obtain the following: AttributeError: 'Rolling' object has no attribute 'agg'
df.swifter.groupby(df[ticker])[columns].rolling(i).agg(functions)
hey @jmcarpenter2
df.swifter.groupby(by)[key].apply(func) doesn't work wish that error
File "/home/++++/.local/lib/python3.6/site-packages/swifter/swifter.py", line 623, in _ray_groupby_apply_chunk grpby = chunk.groupby(by, axis=self._axis, **self._grpby_kwargs) TypeError: groupby() got an unexpected keyword argument 'dropna'
Hey @jmcarpenter2, I get a TypeError when I run this code:
df.swifter.groupby(pd.Grouper(key='start_date', freq='1H')).apply(foo)
File ~/++++/.venv/lib/python3.10/site-packages/swifter/swifter.py:592, in GroupBy._get_chunks(self) 591 def _get_chunks(self): --> 592 subset_df = self._obj_pd.index if self._grpby_index else self._obj_pd[self._by[0]] 593 unique_groups = subset_df.unique() 594 n_splits = min(len(unique_groups), self._npartitions)
'TimeGrouper' object is not subscriptable
Hey @Zimmerling , thanks for flagging this. I need to expand support for pd.Grouper
objects. Currently, the groupby only support the by
argument for columns or index. But use-case is very clear from your example, thanks for showing that. I will work on this soon
It would be nice to be able to apply on groupby objects currently error is raised