Closed mlahir1 closed 5 years ago
I have increased the number of rows to 10000. df['key'] = [i for i in range(10000)] df['val'] = [i for i in range(10000)] df['val1'] = [i for i in range(10000)]
CPU times: user 5min 22s, sys: 40.6 s, total: 6min 3s Wall time: 6min 51s
CPU times: user 88 ms, sys: 4 ms, total: 92 ms Wall time: 423 ms
cc @randerzander
@mlahir1 could you please also include your cudf / dask-cudf versions and method of installation?
I am building from branch-0.8 source(dask-cudf). The source code is 2 days old - built Tuesday Morning at 4AM CST.
These notes in the Dask Best Practices docs may be relevant
http://docs.dask.org/en/latest/dataframe-best-practices.html#use-pandas http://docs.dask.org/en/latest/best-practices.html#start-small
On Wed, May 29, 2019 at 6:18 PM Lahir Marni notifications@github.com wrote:
I am building from branch-0.8 source(dask-cudf). The source code is 2 days old - built Tuesday Morning at 4AM CST.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rapidsai/dask-cudf/issues/265?email_source=notifications&email_token=AACKZTAL3MCKK7PIPBXWP7LPX4FL5A5CNFSM4HQUYYU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWQ4WXA#issuecomment-497142620, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTFCBNYSZPP6D75H3ZTPX4FL5ANCNFSM4HQUYYUQ .
Currently, on our branch-0.8, I believe @mlahir1 's code will error. The PR rapidsai/cudf#1882 should resolve the error when it merges.
However, when I run the example code with cuDF built on that PR, I also see it taking an incredibly long time (eventually finishing).
The equivalent pandas/dask.dataframe code takes no time at all:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame()
df['key'] = [i for i in range(10000)]
df['val'] = [i for i in range(10000)]
df['val1'] = [i for i in range(10000)]
df['mult'] = df['key'] * df['val']
ddf = dd.from_pandas(df, npartitions=2)
----
%time ddf.groupby(['key', 'val']).mult.mean().compute()
CPU times: user 60 ms, sys: 8 ms, total: 68 ms
Wall time: 47.7 ms
@beckernick is it the from_cudf
call taking a long time or the actual ops underneath running in Dask?
@kkraus14 the from_cudf call is quick. Profiling it a bit more now.
@kkraus14 @thomcom , when done in dask-cudf, the masked_assign
in the initialization of the groupby takes up an inordinate amount of the time. Within that, cudautils.compact_mask_bytes
takes up the most time (and within that, its actually making the mask that takes the most time).
I have a quick and easy fix using replace
instead of masked_assign
that improves performance by 50x, but it is still too slow. I'm working on another solution.
I wrote a better solution that doesn't use replace
or masked_assign
. Runtime for 7 rows is 600msec, 10k rows is 800msec, and 50m rows is 5sec. PR coming up.
The group-by throws following error after rapidsai/cudf#1896 :
AttributeError: '_DataFrameIlocIndexer' object has no attribute '_getitem_multiindex_arg'
@mlahir1 can you confirm you have the latest from branch-0.8 ?
I am sure i have latest cudf from branch-0.8. I can try clean building it and testing it again.
Strange, that issue was resolved by https://github.com/rapidsai/cudf/pull/1882
@thomcom I clean built it and now it works fine. Strange. Thanks.
Multi Column groupby on dask_cudf is over 10x slower compared to cudf
Dask CUDF
Performance Results: CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 542 µs CPU times: user 704 ms, sys: 20 ms, total: 724 ms Wall time: 725 ms CPU times: user 392 ms, sys: 4 ms, total: 396 ms Wall time: 394 ms
CUDF
Performance Results: CPU times: user 4 ms, sys: 0 ns, total: 4 ms Wall time: 1.19 ms CPU times: user 76 ms, sys: 0 ns, total: 76 ms Wall time: 75.1 ms
The scale is exponentially larger in larger DF's.