rapidsai / dask-cudf

[ARCHIVED] Dask support for distributed GDF object --> Moved to cudf
https://github.com/rapidsai/cudf
Apache License 2.0
136 stars 58 forks source link

[BUG] Multi Column Groupby is Very Slow #265

Closed mlahir1 closed 5 years ago

mlahir1 commented 5 years ago

Multi Column groupby on dask_cudf is over 10x slower compared to cudf

Dask CUDF

from cudf import DataFrame
import dask_cudf
df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
df['val1'] = [0, 1, 2, 3, 4, 5, 6]

df['mult'] = df['key'] * df['val']

ddf = dask_cudf.from_cudf(df, npartitions=2)

%time groups = ddf.groupby(['key', 'val'])
%time res = groups.mult.mean()
%time val = res.compute()

print(val.to_pandas())

Performance Results: CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 542 µs CPU times: user 704 ms, sys: 20 ms, total: 724 ms Wall time: 725 ms CPU times: user 392 ms, sys: 4 ms, total: 396 ms Wall time: 394 ms

CUDF

from cudf import DataFrame
import dask_cudf
df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
df['val1'] = [0, 1, 2, 3, 4, 5, 6]

df['mult'] = df['key'] * df['val']

%time groups = df.groupby(['key', 'val'], method='cudf')
%time res = groups.mean()
print(res)

Performance Results: CPU times: user 4 ms, sys: 0 ns, total: 4 ms Wall time: 1.19 ms CPU times: user 76 ms, sys: 0 ns, total: 76 ms Wall time: 75.1 ms

The scale is exponentially larger in larger DF's.

mlahir1 commented 5 years ago

I have increased the number of rows to 10000. df['key'] = [i for i in range(10000)] df['val'] = [i for i in range(10000)] df['val1'] = [i for i in range(10000)]

dask_cudf

CPU times: user 5min 22s, sys: 40.6 s, total: 6min 3s Wall time: 6min 51s

cudf

CPU times: user 88 ms, sys: 4 ms, total: 92 ms Wall time: 423 ms

beckernick commented 5 years ago

cc @randerzander

@mlahir1 could you please also include your cudf / dask-cudf versions and method of installation?

mlahir1 commented 5 years ago

I am building from branch-0.8 source(dask-cudf). The source code is 2 days old - built Tuesday Morning at 4AM CST.

mrocklin commented 5 years ago

These notes in the Dask Best Practices docs may be relevant

http://docs.dask.org/en/latest/dataframe-best-practices.html#use-pandas http://docs.dask.org/en/latest/best-practices.html#start-small

On Wed, May 29, 2019 at 6:18 PM Lahir Marni notifications@github.com wrote:

I am building from branch-0.8 source(dask-cudf). The source code is 2 days old - built Tuesday Morning at 4AM CST.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rapidsai/dask-cudf/issues/265?email_source=notifications&email_token=AACKZTAL3MCKK7PIPBXWP7LPX4FL5A5CNFSM4HQUYYU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWQ4WXA#issuecomment-497142620, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTFCBNYSZPP6D75H3ZTPX4FL5ANCNFSM4HQUYYUQ .

beckernick commented 5 years ago

Currently, on our branch-0.8, I believe @mlahir1 's code will error. The PR rapidsai/cudf#1882 should resolve the error when it merges.

However, when I run the example code with cuDF built on that PR, I also see it taking an incredibly long time (eventually finishing).

The equivalent pandas/dask.dataframe code takes no time at all:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame()
df['key'] = [i for i in range(10000)]
df['val'] = [i for i in range(10000)]
df['val1'] = [i for i in range(10000)]
df['mult'] = df['key'] * df['val']

ddf = dd.from_pandas(df, npartitions=2)
----
%time ddf.groupby(['key', 'val']).mult.mean().compute()
CPU times: user 60 ms, sys: 8 ms, total: 68 ms
Wall time: 47.7 ms
kkraus14 commented 5 years ago

@beckernick is it the from_cudf call taking a long time or the actual ops underneath running in Dask?

beckernick commented 5 years ago

@kkraus14 the from_cudf call is quick. Profiling it a bit more now.

beckernick commented 5 years ago

@kkraus14 @thomcom , when done in dask-cudf, the masked_assign in the initialization of the groupby takes up an inordinate amount of the time. Within that, cudautils.compact_mask_bytes takes up the most time (and within that, its actually making the mask that takes the most time).

thomcom commented 5 years ago

I have a quick and easy fix using replace instead of masked_assign that improves performance by 50x, but it is still too slow. I'm working on another solution.

thomcom commented 5 years ago

I wrote a better solution that doesn't use replace or masked_assign. Runtime for 7 rows is 600msec, 10k rows is 800msec, and 50m rows is 5sec. PR coming up.

mlahir1 commented 5 years ago

The group-by throws following error after rapidsai/cudf#1896 :

AttributeError: '_DataFrameIlocIndexer' object has no attribute '_getitem_multiindex_arg'

quasiben commented 5 years ago

@mlahir1 can you confirm you have the latest from branch-0.8 ?

mlahir1 commented 5 years ago

I am sure i have latest cudf from branch-0.8. I can try clean building it and testing it again.

Changelog:

thomcom commented 5 years ago

Strange, that issue was resolved by https://github.com/rapidsai/cudf/pull/1882

mlahir1 commented 5 years ago

@thomcom I clean built it and now it works fine. Strange. Thanks.