Open ericphan-nv opened 4 days ago
It seems like the slowdowns are due to from_pandas
spending lots of time in _has_any_nan
.
It seems like maybe this is happening in the cells that call replace
?
# Apply the consolidation
df['Company'] = df['Company'][df['Company'].isin(name_mapping.keys())].replace(name_mapping).astype('category')
takes ~75 seconds in DataFrame.__getitem__
. I think this is related to the _has_any_nan
call?
I am not able to dig any further on this at the moment but perhaps @galipremsagar or @mroeschke would have insight.
I found the bug, working on a fix.
Describe the bug Low performance with cuDF Pandas and XGBoost using this dataset and notebook.
Performance is slower than CPU equivalents. Tested on Colab with L4 and local WSL with 4090.
Steps/Code to reproduce bug Open the notebook and run through the cells. Observe the slow performance compared to CPU Pandas and XGboost.
Expected behavior Performance is expected to be significantly faster than CPU with cuDF Pandas and XGBoost.
Environment overview (please complete the following information)
Additional context
Colab L4: Loading time - 46 seconds Preprocessing time - 476 seconds Training time - 240 seconds
Colab CPU: Loading time - 23 seconds Preprocessing time - 47 seconds Training time - 252 seconds