lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.49k stars 809 forks source link

Difference between the high dimensional graph adjacency weights and the euclidian distances on low-dimensional embeddings #934

Open jules-samaran opened 2 years ago

jules-samaran commented 2 years ago

Dear authors,

My question is the following: since the low-dimensional embeddings are optimized so that their weighted graph representation is as close as possible to the "high dimensional graph" (the "closeness" here being defined as the cross entropy) and since the weights of the "low-dimensional graph" graph are set as a non-increasing function of the euclidean distances between embeddings, why is it that I don't obtain the same neighborhoods when I compare ranking (in ascending order) the weights of the high dimensional graph and ranking in descending order the euclidean distances between embeddings?

Here is some code linked to the question:

from umap import UMAP
from sklearn.neighbors import NearestNeighbors

data = np.random.rand(400, 45)
reducer = UMAP()
embedding = reducer.fit_transform(data)

graph = reducer.graph_.todense()

nbrs = NearestNeighbors(n_neighbors=50, algorithm='ball_tree', metric="euclidean").fit(embedding)
_, indices = nbrs.kneighbors(embedding)

graph_indices = np.argsort(graph, axis=1)[:, ::-1][:, :50]  

Is there a reason why the graph and the embeddings would not give the same neighborhoods?

jlmelville commented 2 years ago

You can squeeze a lot more neighbors around an item in a high dimension than in a low dimension. There's just no room in the low dimension to preserve the neighborhoods exactly from a 45D Gaussian.

In general even for good results where you see sensible clustering, you may only get 10-20% of the k-nearest neighbors in the high dimensional space that are also in the low dimensional k-nearest neighbors.

Also be aware that UMAP defaults have n_neighbors=15 so you should take that into account when considering your expectations around how many of the 50-nearest neighbors are retained.

Finally, for well-known datasets like the MNIST digits, the intrinsic dimensionality has been estimated at being substantially lower than 45 (more like 8). So your example may not be describing a particularly easy task compared to real world datasets.