rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.38k stars 896 forks source link

[BUG] Slow Performance of cuDF Pandas on L4 #17140

Open ericphan-nv opened 4 days ago

ericphan-nv commented 4 days ago

Describe the bug Low performance with cuDF Pandas and XGBoost using this dataset and notebook.

Performance is slower than CPU equivalents. Tested on Colab with L4 and local WSL with 4090.

Steps/Code to reproduce bug Open the notebook and run through the cells. Observe the slow performance compared to CPU Pandas and XGboost.

Expected behavior Performance is expected to be significantly faster than CPU with cuDF Pandas and XGBoost.

Environment overview (please complete the following information)

Additional context

Colab L4: Loading time - 46 seconds Preprocessing time - 476 seconds Training time - 240 seconds

Colab CPU: Loading time - 23 seconds Preprocessing time - 47 seconds Training time - 252 seconds

bdice commented 3 days ago

It seems like the slowdowns are due to from_pandas spending lots of time in _has_any_nan.

https://github.com/rapidsai/cudf/blob/4fe338c0efe0fee2ee69c8207f9f4cbe9aa4d4a2/python/cudf/cudf/core/column/column.py#L1478-L1483

It seems like maybe this is happening in the cells that call replace?

# Apply the consolidation
df['Company'] = df['Company'][df['Company'].isin(name_mapping.keys())].replace(name_mapping).astype('category')

takes ~75 seconds in DataFrame.__getitem__. I think this is related to the _has_any_nan call?

I am not able to dig any further on this at the moment but perhaps @galipremsagar or @mroeschke would have insight.

galipremsagar commented 3 days ago

I found the bug, working on a fix.