jlmelville / uwot

An R package implementing the UMAP dimensionality reduction method.
https://jlmelville.github.io/uwot/
GNU General Public License v3.0
321 stars 31 forks source link

Set random seed for umap_transform() #10

Closed ummel closed 5 years ago

ummel commented 6 years ago

There is a discussion over at the python UMAP repo where it is mentioned that the python equivalent of uwot::umap_transform includes an option to set a random number seed internally:

The transform function should now be consistent in the transformation (via a fixed transform seed which you can pick on instantiation if you wish).

The current behavior of umap_transform is, for example:

iris_umap <- umap(iris, ret_model = TRUE)
iris_umap2 <- umap_transform(iris, model = iris_umap)

Which will yield something like:

head(iris_umap$embedding) [,1] [,2] [1,] 4.200771 8.554187 [2,] 3.545428 6.722146 [3,] 3.172449 7.146079 [4,] 3.298905 7.069255 [5,] 4.084401 8.499889 [6,] 5.120863 9.330461

head(iris_umap2) [,1] [,2] [1,] 4.069650 8.825538 [2,] 3.723976 6.704192 [3,] 3.110911 7.511822 [4,] 3.286992 6.673110 [5,] 3.893319 8.772491 [6,] 5.126285 9.693930

The results are close but not exactly the same, due to the stochastic nature of the UMAP algorithm.

However, a relevant (and, I think, potentially common) use case is when umap() is used to reduce the dimensionality of a set of X predictor variables and a model is then trained on the embedding. If new observations (X2) become available for which we wish to make predictions, those values need to be deterministically translated into the embedding space (using the same random number generation as the original UMAP calculations). Even relatively small differences in how the "natural" X values are translated to the embedded space could cause identical X and X2 observations to generate different predictions from subsequent models.

How difficult would it be to enable a random seed to be set (and returned) for umap and then later passed to umap_transform?

jlmelville commented 6 years ago

It's definitely reasonable to want UMAP to be deterministic given a specific seed. You don't mention set.seed, so just to be clear, would it be sufficient for umap() and umap_transform() to be repeatable in the following way:

set.seed(123)
iris_umap <- umap(iris, ret_model = TRUE)
set.seed(123)
iris_umap2 <- umap(iris, ret_model = TRUE)
# iris_umap$embedding and iris_umap2$embedding should be the same (but aren't currently)

and:

iris_umap <- umap(iris, ret_model = TRUE)
set.seed(123)
iris_umap2 <- umap_transform(iris, model = iris_umap)
set.seed(123)
iris_umap3 <- umap_transform(iris, model = iris_umap)
# iris_umap2 and iris_umap3 should be the same (but also won't be)

The set.seed call could be moved internally into umap and umap_transform if needed. Presumably there is a use case where the ability to call set.seed manually isn't available?

Anyway, this is all hypothetical because unfortunately, even when calling set.seed, results are currently not repeatable. The challenge is to generate lots of random numbers quickly, and to do so in multiple threads. The current convoluted implementation seems to have introduced a dependency on the order that threads are executed (I think). I would like to fix this.

Finally, even if umap becomes deterministic for a given random seed, I should mention that I think the example code you give will never result in iris_umap$embedding and iris_umap2 being the same. iris_umap is initialized by a spectral decomposition of the normalized Laplacian of the input graph, whereas iris_umap2 is initialized based on the coordinates in iris_umap$embedding. Even if the initialization was the same, the coordinates are optimized for a different number of epochs in the two methods and neither get close to convergence.

jlmelville commented 6 years ago

Now I think about it, the problem isn't necessarily with the random number generation, but that we are at the mercy of the thread scheduler and are writing to data that is accessible from every thread. The order that threads do work has an effect on the different workers because each thread works on their own non-overlapping chunk of edges, but the vertices that make up those edges could appear in any of the chunks and their positions are therefore updated by different threads. Additionally, negative sampling is carried out over the entire dataset, so that's another way the position of a vertex can pop up at different times in different threads.

The way to fix this would be to accumulate the updates during each epoch, but not change the actual position of any vertex. At the end of each epoch, the update would be applied to the positions. This would have two downsides: first, it doubles the memory storage requirements of the output coordinates (although not UMAP as a whole); second, it probably slows convergence. The upside is that this method would allow experimentation with momentum methods to speed convergence (probably at the cost of yet more memory, though).

I'll think about whether there's a way to implement this without having to maintain two separate but very similar methods, reducing performance due to extra branch checks, or a nightmarish maze of template specializations.

jlmelville commented 6 years ago

Note to myself about implementation: to completely avoid the risk of interleaved read-and-updates of the same coordinate between different threads, you would have to make a copy of the data structure storing the edges, and accumulate updates on a per-edge basis. At the end of each epoch, you would then "reduce" this to a vertex-wise coordinate update. Default UMAP settings typically result in at least an order of magnitude more edges than vertexes, so that's a much larger memory storage requirement (and a further pain to implement). The alternative is to do some research into alternatives to locking the entire update data structure (which otherwise would happen a lot).

I don't have immediate plans to fix this. I have noted the problem in the README.

jlmelville commented 5 years ago

A discussion over at UMAP https://github.com/lmcinnes/umap/issues/158 seems related.

jlmelville commented 5 years ago

Due to work in #15, the code like this:

iris_umap <- umap(iris, ret_model = TRUE)
set.seed(123)
iris_umap2 <- umap_transform(iris, model = iris_umap)
set.seed(123)
iris_umap3 <- umap_transform(iris, model = iris_umap)

is now repeatable, if you install from trunk (or use release 0.0.0.9008 when it comes out).

I don't think a transform_seed is needed in R, because set.seed is the usual way to do this. Please re-open if this isn't an adequate solution.