lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.37k stars 799 forks source link

Can the `Johnson–Lindenstrauss lemma` be a replacement for `Spectral embedding`? #901

Closed igaloly closed 1 year ago

igaloly commented 2 years ago

With the guarantees of the JL lemma, the increased speed of the "random projection", and because the objective of the Spectral embedding is to make the reduction deterministic. Might it be worth using the JL lemma instead of Spectral embedding?

jlmelville commented 2 years ago

The purpose of the spectral initialization is a bit different from a random projection. The spectral initialization has the advantage of:

Some possible downsides of using a random projection:

At any rate, it's easy to try this with sklearn:

    import sklearn.random_projection

    # assume your data is in the variable x
    # probably also want to set random_seed
    embedder = sklearn.random_projection.SparseRandomProjection(
        n_components=2
    )
    init_coords = embedder.fit_transform(x)

You can then run UMAP with init=init_coords to use the random projection input as initialization. In my experience, it didn't give results that seemed particularly great. I think I would prefer truncated SVD for this kind of initialization unless speed really was an issue (these gave results I preferred the look of but I did no quantitative evaluation).

One way you could use random projections would be if you had very high dimensional data and wanted to reduce the dimensionality initially before applying UMAP. Shah and Silwal looked at this (in the context of t-SNE), but found that PCA did a better job, so truncated SVD would again be preferred for that. Oh well.