Closed ummel closed 5 years ago
It's definitely reasonable to want UMAP to be deterministic given a specific seed. You don't mention set.seed
, so just to be clear, would it be sufficient for umap()
and umap_transform()
to be repeatable in the following way:
set.seed(123)
iris_umap <- umap(iris, ret_model = TRUE)
set.seed(123)
iris_umap2 <- umap(iris, ret_model = TRUE)
# iris_umap$embedding and iris_umap2$embedding should be the same (but aren't currently)
and:
iris_umap <- umap(iris, ret_model = TRUE)
set.seed(123)
iris_umap2 <- umap_transform(iris, model = iris_umap)
set.seed(123)
iris_umap3 <- umap_transform(iris, model = iris_umap)
# iris_umap2 and iris_umap3 should be the same (but also won't be)
The set.seed
call could be moved internally into umap
and umap_transform
if needed. Presumably there is a use case where the ability to call set.seed
manually isn't available?
Anyway, this is all hypothetical because unfortunately, even when calling set.seed
, results are currently not repeatable. The challenge is to generate lots of random numbers quickly, and to do so in multiple threads. The current convoluted implementation seems to have introduced a dependency on the order that threads are executed (I think). I would like to fix this.
Finally, even if umap
becomes deterministic for a given random seed, I should mention that I think the example code you give will never result in iris_umap$embedding
and iris_umap2
being the same. iris_umap
is initialized by a spectral decomposition of the normalized Laplacian of the input graph, whereas iris_umap2
is initialized based on the coordinates in iris_umap$embedding
. Even if the initialization was the same, the coordinates are optimized for a different number of epochs in the two methods and neither get close to convergence.
Now I think about it, the problem isn't necessarily with the random number generation, but that we are at the mercy of the thread scheduler and are writing to data that is accessible from every thread. The order that threads do work has an effect on the different workers because each thread works on their own non-overlapping chunk of edges, but the vertices that make up those edges could appear in any of the chunks and their positions are therefore updated by different threads. Additionally, negative sampling is carried out over the entire dataset, so that's another way the position of a vertex can pop up at different times in different threads.
The way to fix this would be to accumulate the updates during each epoch, but not change the actual position of any vertex. At the end of each epoch, the update would be applied to the positions. This would have two downsides: first, it doubles the memory storage requirements of the output coordinates (although not UMAP as a whole); second, it probably slows convergence. The upside is that this method would allow experimentation with momentum methods to speed convergence (probably at the cost of yet more memory, though).
I'll think about whether there's a way to implement this without having to maintain two separate but very similar methods, reducing performance due to extra branch checks, or a nightmarish maze of template specializations.
Note to myself about implementation: to completely avoid the risk of interleaved read-and-updates of the same coordinate between different threads, you would have to make a copy of the data structure storing the edges, and accumulate updates on a per-edge basis. At the end of each epoch, you would then "reduce" this to a vertex-wise coordinate update. Default UMAP settings typically result in at least an order of magnitude more edges than vertexes, so that's a much larger memory storage requirement (and a further pain to implement). The alternative is to do some research into alternatives to locking the entire update data structure (which otherwise would happen a lot).
I don't have immediate plans to fix this. I have noted the problem in the README.
A discussion over at UMAP https://github.com/lmcinnes/umap/issues/158 seems related.
Due to work in #15, the code like this:
iris_umap <- umap(iris, ret_model = TRUE)
set.seed(123)
iris_umap2 <- umap_transform(iris, model = iris_umap)
set.seed(123)
iris_umap3 <- umap_transform(iris, model = iris_umap)
is now repeatable, if you install from trunk (or use release 0.0.0.9008 when it comes out).
I don't think a transform_seed
is needed in R, because set.seed
is the usual way to do this. Please re-open if this isn't an adequate solution.
There is a discussion over at the python UMAP repo where it is mentioned that the python equivalent of
uwot::umap_transform
includes an option to set a random number seed internally:The current behavior of
umap_transform
is, for example:Which will yield something like:
The results are close but not exactly the same, due to the stochastic nature of the UMAP algorithm.
However, a relevant (and, I think, potentially common) use case is when
umap()
is used to reduce the dimensionality of a set ofX
predictor variables and a model is then trained on the embedding. If new observations (X2
) become available for which we wish to make predictions, those values need to be deterministically translated into the embedding space (using the same random number generation as the original UMAP calculations). Even relatively small differences in how the "natural"X
values are translated to the embedded space could cause identicalX
andX2
observations to generate different predictions from subsequent models.How difficult would it be to enable a random seed to be set (and returned) for
umap
and then later passed toumap_transform
?