Open Jacopobracaloni opened 2 years ago
If anyone was having troubles downloading the database file from the source link, I am attaching it here in the comment section payment_fraud.csv .
Hi @Jacopobracaloni, it certainly should not be taking hours on a dataset with a few tens of thousands of data points, especially with such a small number of columns.
We have been experiencing this behavior from time to time with the Barnes-hut method on GPU and we are working to fix it. In the meantime, we do have an alternative method you can use by setting method='fft'
when constructing the TSNE estimator. Can you try this and let us know if it works for you?
The Fast Fourier Transform method takes exactly 5.43 seconds to run.
I was able to reproduce this issue. FFT works but not Barnes Hut. I'll see if the updates of the Barnes Hut kernels can solve this issue.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Hello to everyone, I am running the CUML TSNE algorithm over the following dataset: https://github.com/oreilly-mlsec/book-resources/tree/master/chapter2/datasets .
The dataset has been loaded as a CUDF dataframe (df = cudf.read_csv("payment_fraud.csv"). The resulting dataframe is made of approximately 32 thousand rows, and 6 columns.
I proceeded to one hot encode the nominal features -- running the cudf.get_dummies(df, columns = 'paymentMethod') -- , resulting in a dataframe of 9 columns in total. To run unsupervised learning over the data, I removed the binary (0/1) 'label' column. After that, I proceeded to run the TSNE .fit_transform() calculation over the resulting dataframe, leaving the method arguments at their default settings (I want TSNE to ouput just 2 components out of the dataframe).
I did not standardize the features as I want to mantain their visualization properties.
Although the dataframe is quite small, TSNE had been calculating for more than 5 hours, when I decided to reset the Jupyter Notebook Kernel.
I am using RAPIDS AI stable release on an UBUNTU 18.04 WSL 2 subsystem. My main OS is Windows 11. I installed the software via Conda (Python 3.8, CUDA 11.5). I have a RTX 3060 Laptop GPU. All the NVIDIA drivers are updated to their latest versions.