rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.22k stars 530 forks source link

[BUG] Why tSNE is stuck depending on the distribution of data? #3865

Open Cliff-Lin opened 3 years ago

Cliff-Lin commented 3 years ago

Describe the bug I have some feature sets whose dimension is 128. If the iteration of tSNE is more than 1000, it can finish in a few minutes for most sets. However, it is stuck for more than 12 hours on other sets. Since I don't know how long it will take, I terminate it before it finishes. Is there any option I can obtain the iterations it has ran? What is the condition causing it stuck? Actually, the proper iteration is 10000 for my case (nearly 500K samples) to gain a better result.

Steps/Code to reproduce bug Just call TSNE(n_iter=10000).fir_trainsform(x)

Expected behavior All sets should be finished within nearly equal time budget.

Environment details (please complete the following information):

Cliff-Lin commented 3 years ago

I've tried fft mode. I can see the running iteration, but it runs 1000x or 10000x slower than the default setting for all sets. It should be weird, right?

mdemoret-nv commented 3 years ago

Thanks for your bug report! I can answer a few of your questions but will need the help of a couple of my colleagues to diagnose the root cause of the problem.

First, can you give us more information to help us reproduce the issue on our end? You mentioned that some feature sets work and others get stuck. Do you have examples of both of these feature sets that we could try?

Regarding your other questions:

Is there any option I can obtain the iterations it has ran?

As far as I know, this is currently not possible. If you are terminating the TSNE.fit() call with Ctrl+C, this will kill the current executing statement that is performing the iteration. New functionality would need to be added to store or write out intermediate iterations that could be viewed after a KeyboardInterrupt is raised.

In the meantime, I would suggest you try:

  1. Enable debug logging
    1. With debug logging enabled additional information about the current state of the algorithm will be printed to the log every 100 iterations.
  2. Incrementally increase the number of iterations.
    1. By using a fixed seed and iteratively increasing the number of iterations (i.e. n_iter=100, 1000, 2000, ... 10000, etc.) you should be able to get the output at intermediate states.

@cjnolet and @divyegala Do you have any other suggestions or an idea of what could be causing the TSNE algorithm to take longer for certain feature sets?

Cliff-Lin commented 3 years ago

The links below are the sample sets:

short.npy can be processed in a few minutes under the default setting of tSNE while long.npy can not. The feature numbers and dimensions of them are the same. May these files are helpful for you. I will adopt your suggestion to debug it.

lowener commented 2 years ago

I've tried fft mode. I can see the running iteration, but it runs 1000x or 10000x slower than the default setting for all sets. It should be weird, right?

I could not reproduce the slowdown that you experienced with the FFT method. Running the short dataset takes me 55 seconds, the long dataset 65 seconds, and running it with a synthetic dataset from make_blobs or make_classification of the same shape as your dataset takes me 60 seconds, with the FFT method and default parameters. So the FFT method

And to add a few information on the Barnes-Hut method, it is hanging on my side too. After adding logging information I saw that the short dataset was blocked around iteration 678 and the short dataset around iteration 526.

I'm using the current dev version (21.12) with Ubuntu 20.04, CUDA 11.5 and driver 470 on a Quadro RTX 8000.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.