rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.33k stars 888 forks source link

pandas pivot table fillna performance substantially slower when running on cuDF gpu?[QST] #14396

Closed ALIANS-PRODUCTIONS closed 10 months ago

ALIANS-PRODUCTIONS commented 10 months ago

Windows 11 pc, WSL2, ubuntu, conda virtual env, python 3.10, pandas 1.5.3 and cudf23.10, cuda-toolkit 11.8, 12900k, 3090 gpu 128gb dd5 memory stock.

What is your question?

[I am not sure if I have set up rapids/cudf correctly hence posting here to sanity check and appreciate any guidance.] pivot tables on a pandas data frame show a substantial performance DROP when using cuDF/GPU versus cpu. Attached profile for a block of notebook code highlights significant slowdown when using FILLNA on pivot table.

Is this expected/known behaviour?

For reference the block of code on cpu took 0.5s, on cuDF was taking 8.6s

performance delta by keeping everything else the same just adding/removing %load_ext cudf.pandas at the start of the notebook Notebook run from windows, visual code.

Attached the output of %%cudf.pandas.profile (adding the profiler increases the time to 19secs)

image

bdice commented 10 months ago

@ALIANS-PRODUCTIONS Can you provide a minimal reproducer code demonstrating this slowdown? It’s hard to reason about performance without seeing the code, and the profiler snippet you shared doesn’t quite show all the detail I want to see. The data types in your dataframe and the number of columns and rows would be helpful to know, too.

ALIANS-PRODUCTIONS commented 10 months ago

Thank you for looking into this! I have 'anonymised' the data and taken a sample of the code, which exhibits the same behaviour 0.1 secs on cpu v 4s on cudf. Attached screenshot off the time impact if we rem out the fillna line, time for completion comes to 1.7s

Please find below the proto code and then screen shots of what the dataframe looks like:

pivot_annon_df=annon_df.pivot_table(index='column_1', columns='column_11', values='column_4').copy() pivot_annon_df=pivot_annon_df.reset_index() pivot_annon_df=pivot_annon_df.fillna(0) pivot_annon_df=pivot_annon_df.set_index('column_1') pivot_annon_df.index.name=None pivot_annon_df=pivot_annon_df.transpose() pivot_annon_df.index.name=None

image

impact of removing the fillna line : time taken collapses to 1.7s (still very slow compared to cpu)

image

shwina commented 10 months ago

Thanks, @ALIANS-PRODUCTIONS -- just to get an idea of the dimensions here: what's the shape of your DataFrame after the call to pivot_table?

ALIANS-PRODUCTIONS commented 10 months ago

for this test object : 3335 rows by 1862 cols, but for the actual code the COLUMNS can be over 500k image

shwina commented 10 months ago

Thanks! Note that:

I wonder if there's a way to recast your operations so that they are performed on tall-and-skinny dataframes rather than short-and-wide dataframes. If you can provide a complete program with perhaps a representative dataset, we're definitely happy to see if it's possible to write it in a GPU-friendly way!

shwina commented 10 months ago

I should say that Pandas will probably benefit a lot from operating on tall-and-skinny data as well (so it is worth doing generally!)

ALIANS-PRODUCTIONS commented 10 months ago

hmmm... am not sure if that will be possible, the actual rows can go well over 500k to 2mio , and approx 3000 cols wide.

Not sure how that compares for the tall v skinny bmk.

When I run on the full dataset i do still see the same glaring time difference.

The code above is pretty much the full code that is causing the issue. Attached is a HDF5 dump of the annon_df dataframe. Would be very interesting to see what performance figures you get on your side if you have time Thank you

annon_df.zip

shwina commented 10 months ago

Thanks for the additional info!

The code above is pretty much the full code that is causing the issue.

So it looks like the code you provided "reshapes" your data from shape (17990, 12) to (3335, 1862) - and notably, the latter DataFrame is mostly comprised of zeros.

Just curious: what kind of operations are you doing after this reshaping?

I'm asking because 99% of the time, it's possible to do those without having reshaped the data at all.

Just to demonstrate, a groupby() operation yields much the same information as the pivot_annon_df object you have. But it's MUCH faster and memory efficient (it doesn't store any of the zeros).

>>> pivot_annon_df.loc["000f53a5e4a07c0f63a2d96007093926574617993de54dd962ac468cdefc5458", "2023-11-10 16:53:59.999"]
0.46219328742562715
>>> grouped = annon_df.groupby(['column_11', 'column_1']).column_4.mean() # much faster to compute than pivot_annon_df
>>> grouped["000f53a5e4a07c0f63a2d96007093926574617993de54dd962ac468cdefc5458", "2023-11-10 16:53:59.999"]
0.46219328742562715

Let me know if this sounds like an interesting route to pursue for you. If you can avoid reshaping your data, you definitely should - you'll see nice speedups on both CPU and GPU (but especially the latter!)

ALIANS-PRODUCTIONS commented 10 months ago

Thank you very much for that suggestion, i will look into refactoring the code, am not applying any functions could not be applied in the manner you have described above.

Just for my own sanity check (to make sure my cuDF / WSL install was working correctly... may I confirm if the code as is does run slower on cuDF v cpu?

shwina commented 10 months ago

Ah - sorry for not answering that right off the bat. Indeed, I see slow GPU performance with the snippet you provided: 5s on GPU versus 1.5s on CPU.

With the groupby-based approach, here are my timings:

# regular pandas:
In [3]: %%time
   ...: annon_df.groupby(['column_11', 'column_1']).column_4.mean()
   ...: 
   ...: 
CPU times: user 13.8 ms, sys: 0 ns, total: 13.8 ms
Wall time: 13.4 ms

# cudf.pandas enabled:
In [5]: %%time
   ...: annon_df.groupby(['column_11', 'column_1']).column_4.mean()
   ...: 
   ...: 
CPU times: user 18.3 ms, sys: 358 µs, total: 18.6 ms
Wall time: 20.8 ms

Overall, much better timings than before. For this modest data size, there's not a speedup from using the GPU - but you should see it for your larger datasets.

My own specs:

shwina commented 10 months ago

@ALIANS-PRODUCTIONS I'm going to go ahead and close this issue out, but -- if you have any updates or follow-up questions, please feel free to reopen at any time!