jdonaldson / rtsne

An R package for t-SNE (t-Distributed Stochastic Neighbor Embedding)
58 stars 24 forks source link

Parametric t-SNE #3

Closed mahdeto closed 7 years ago

mahdeto commented 7 years ago

Basically the ability to embed new points (ones that I did not have at the time of training) into the map without retraining using the entire data set. A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what Laurens did in this paper.

jdonaldson commented 7 years ago

One way to do this is to use the initial_config parameter. This parameter accepts a pre-calculated embedding as a bootstrap.

The process would involve:

  1. Train an embedding on an initial set of data, and save it.
  2. Add new observations to the training data, and set their locations in the embedding to some initial coordinates (e.g. perhaps the origin, or a median of some sort).
  3. Provide the modified training data to the tsne function, along with the modified embedding (as initial_config).

This carries some caveats, so here's my thoughts on those:

  1. If an initial_config is provided, tsne skips the initial embedding phase (a PCA layout). Normally, this is what you want, but if you add too much data for training it's probably just a better idea to re-do the initial layout.
  2. It's a good idea to decrease the min_cost parameter on the second pass, to give the new data a fair chance to find an optimal embedding.

Keep in mind you can visualize the progress of the tsne algorithm using the epoch_callback method. You could flag the new points, watch them settle, and get a better idea of how to optimize the parameters.