rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.23k stars 532 forks source link

[QST] TSNE being calculating for too long on a "quite small" dataframe #4488

Open Jacopobracaloni opened 2 years ago

Jacopobracaloni commented 2 years ago

Hello to everyone, I am running the CUML TSNE algorithm over the following dataset: https://github.com/oreilly-mlsec/book-resources/tree/master/chapter2/datasets .

The dataset has been loaded as a CUDF dataframe (df = cudf.read_csv("payment_fraud.csv"). The resulting dataframe is made of approximately 32 thousand rows, and 6 columns.

I proceeded to one hot encode the nominal features -- running the cudf.get_dummies(df, columns = 'paymentMethod') -- , resulting in a dataframe of 9 columns in total. To run unsupervised learning over the data, I removed the binary (0/1) 'label' column. After that, I proceeded to run the TSNE .fit_transform() calculation over the resulting dataframe, leaving the method arguments at their default settings (I want TSNE to ouput just 2 components out of the dataframe).

I did not standardize the features as I want to mantain their visualization properties.

Although the dataframe is quite small, TSNE had been calculating for more than 5 hours, when I decided to reset the Jupyter Notebook Kernel.

I am using RAPIDS AI stable release on an UBUNTU 18.04 WSL 2 subsystem. My main OS is Windows 11. I installed the software via Conda (Python 3.8, CUDA 11.5). I have a RTX 3060 Laptop GPU. All the NVIDIA drivers are updated to their latest versions.

Jacopobracaloni commented 2 years ago

If anyone was having troubles downloading the database file from the source link, I am attaching it here in the comment section payment_fraud.csv .

cjnolet commented 2 years ago

Hi @Jacopobracaloni, it certainly should not be taking hours on a dataset with a few tens of thousands of data points, especially with such a small number of columns.

We have been experiencing this behavior from time to time with the Barnes-hut method on GPU and we are working to fix it. In the meantime, we do have an alternative method you can use by setting method='fft' when constructing the TSNE estimator. Can you try this and let us know if it works for you?

Jacopobracaloni commented 2 years ago

The Fast Fourier Transform method takes exactly 5.43 seconds to run.

lowener commented 2 years ago

I was able to reproduce this issue. FFT works but not Barnes Hut. I'll see if the updates of the Barnes Hut kernels can solve this issue.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.