lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.44k stars 807 forks source link

Contents of precomputed_knn tuple #848

Open matthieuheitz opened 2 years ago

matthieuheitz commented 2 years ago

I'm trying to feed to UMAP nearest neighbors computed with another method than pynndescent, and so I need to put the data in the format of precomputed_knn that UMAP accept: the tuple: (knn_indices, knn_distances, NNDescentObject) Since I had trouble making an NNDescent object without giving it data, I tried giving it a fake NNDescent object, that returns true to isinstance(knn[2], NNDescent):

import pynndescent
import numpy as np
import umap
import umap.plot
from umap.umap_ import nearest_neighbors
import matplotlib.pyplot as plt

n = 200
N = 3*n
y = np.random.rand(n, 60)
X = np.concatenate((y+20, y, y-20))
synthetic_labels = np.repeat([1, 2, 3], repeats=n)

random_seed = 10

knn = nearest_neighbors(
                        X,
                        n_neighbors=30,
                        metric='euclidean',
                        metric_kwds={},
                        angular=False,
                        random_state=random_seed,
                        )

knn_umap = umap.UMAP(n_neighbors=30, precomputed_knn=knn, random_state=random_seed, force_approximation_algorithm=True)
knn_umap.fit(X)

class MyNNDescent(pynndescent.NNDescent):
    def __init__(self):
        return

knn2 = (knn[0],knn[1],MyNNDescent())
knn_umap2 = umap.UMAP(n_neighbors=30, precomputed_knn=knn2, random_state=random_seed, force_approximation_algorithm=True)
knn_umap2.fit(X)

print("\033[1m"+"Are the embeddings for knn_umap and knn_umap2 the same?\033[0m")
print((knn_umap.embedding_ == knn_umap2.embedding_).all())

fig, ax = plt.subplots(1, 2, figsize=(13,7))
umap.plot.points(knn_umap, labels=synthetic_labels, ax=ax[0], theme='green')
umap.plot.points(knn_umap2, labels=synthetic_labels, ax=ax[1], theme='green')
ax[0].set_title("Precomputed knn 1st run", size=16)
ax[1].set_title("Precomputed knn 2nd run", size=16)
plt.show()

The result is that the 2 embeddings are the same, so it seems that the NNDescent object is not used at all. Is my assumption correct, or is the object used in some particular cases, which would explain why it was included in the tuple? I tried digging into the code, but I can't find the place where this object is used...

lmcinnes commented 2 years ago

It gets used if you ever apply a transform to new data where we need to query the nearest neighbors of the new data in the training set (and the NNDescent object is an efficient index/query oracle for this). In an ideal world anything that can function as an appropriate nearest neighbor search oracle trained on the original data would suffice, but the code probably isn't generalised to support such a thing. If you want to make a PR making that possible that would certainly be appreciated.

matthieuheitz commented 2 years ago

Oh I see, so that's the particular case! I'm not sure I'd have time to do a PR for this unfortunately. If anyone is interested, the methods/variables of NNDescent that are used by UMAP are:

In UMAP.transform():

In UMAP.update():

The new abstract object should probably have a more general interface, with simply query(), update() and neighbor_graph.

jlmelville commented 2 years ago

I think an alternative solution would be to move where the failure occurs:

  1. when validating the pre-computed knn, allow the NNDescent object to be optional and if it is not found, warn that the resulting reducer cannot be used to transform new data.
  2. add a check in transform to raise the ValueError there.

This would have the very minor downside of not causing an error until a user tried to transform new data, but allowing the pre-computed knn to be a lot simpler would be very helpful. This is the same strategy that I use in uwot and no-one has complained about it yet (well, not to me anyway).

  1. There is a related wrinkle not mentioned by @matthieuheitz but due to the presence of force_approximation_algorithm=True in the example code I suspect it was triggered: if you use a pre-computed knn with a small (i.e. N < 4096) dataset and don't set that parameter to True, you get a message that a pre-computed knn is only meant for big data and the supplied data will be ignored. But looking at the code, I don't think UMAP actually follows through on that threat:

https://github.com/lmcinnes/umap/blob/b4e9a318542e797015c2f9c631eab0e0550523c6/umap/umap_.py#L1956-L1961

I am not entirely sure what was intended here: I think it's to ensure that the PyNNDescent code path is followed?

@lmcinnes if 1-2 is an acceptable solution I am happy to take a crack at a PR for that, and 3 as well if I can understand what it's supposed to do.

lmcinnes commented 2 years ago

I like the 1-2 plan since there are other things that trigger warnings that transforms will not eb available and failure doesn't occur 'til later, so it fits the pattern. I would be very happy with a PR if you wanted to try that.

The plan for force_approximation_algorithm=True is to ensure the pynndescent code path gets run -- not least so tests can run with small datasets but exercise that code path. The warning is (likely? I can't recall) because the pre-computed knns were originally for large data since small datasets could simply use metric="precomputed", but to be honest there is no reason not to accept the precomputed knn even in the small case.