rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.2k stars 527 forks source link

[FEA] T-SNE support to embed out-of-sample points #3097

Open cjnolet opened 3 years ago

cjnolet commented 3 years ago

UMAP's ability to embed out-of-sample data points has allowed us to create a fairly simple Dask wrapper for distributed inference, by embedding a subset of the training data on a single GPU and then broadcasting the trained model to Dask workers to perform inference in parallel. We have found that as long as the training dataset is a representative sample of the neighborhoods in the larger population, the distributed embedding is showing even higher overall trustworthiness on some datasets.

If we were able to support out-of-sample embeddings in T-SNE, we should be able to support distributed inference in a similar manner to UMAP. However, this feature has been shown in past research to not be quite as straightforward to implement. It would still be worth investigating the feasibility of this feature for cuML's T-SNE- while distributing the training is going to require distributing the entire optimization stage. Though possible, it's going to have a high communication overhead.

Van Der Maaten created Parametric TSNE, which should be able to accomplish this, but requires the use of deep neural networks: https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf. Also relevant is this issue on Van Der Maaten's Barne's Hut Github.

Additional literature:

zbjornson commented 3 years ago

I think #861 can be merged into this issue.

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.