Open matthieuheitz opened 2 years ago
It gets used if you ever apply a transform to new data where we need to query the nearest neighbors of the new data in the training set (and the NNDescent object is an efficient index/query oracle for this). In an ideal world anything that can function as an appropriate nearest neighbor search oracle trained on the original data would suffice, but the code probably isn't generalised to support such a thing. If you want to make a PR making that possible that would certainly be appreciated.
Oh I see, so that's the particular case! I'm not sure I'd have time to do a PR for this unfortunately. If anyone is interested, the methods/variables of NNDescent that are used by UMAP are:
In UMAP.transform():
query()
_angular_trees
(not important in the case I have the exact neighbors I'm guessing)In UMAP.update():
prepare()
update()
_raw_data
neighbor_graph
The new abstract object should probably have a more general interface, with simply query()
, update()
and neighbor_graph
.
I think an alternative solution would be to move where the failure occurs:
NNDescent
object to be optional and if it is not found, warn that the resulting reducer cannot be used to transform new data.transform
to raise
the ValueError
there.This would have the very minor downside of not causing an error until a user tried to transform new data, but allowing the pre-computed knn to be a lot simpler would be very helpful. This is the same strategy that I use in uwot and no-one has complained about it yet (well, not to me anyway).
force_approximation_algorithm=True
in the example code I suspect it was triggered: if you use a pre-computed knn with a small (i.e. N < 4096) dataset and don't set that parameter to True
, you get a message that a pre-computed knn is only meant for big data and the supplied data will be ignored. But looking at the code, I don't think UMAP actually follows through on that threat:I am not entirely sure what was intended here: I think it's to ensure that the PyNNDescent code path is followed?
@lmcinnes if 1-2 is an acceptable solution I am happy to take a crack at a PR for that, and 3 as well if I can understand what it's supposed to do.
I like the 1-2 plan since there are other things that trigger warnings that transforms will not eb available and failure doesn't occur 'til later, so it fits the pattern. I would be very happy with a PR if you wanted to try that.
The plan for force_approximation_algorithm=True
is to ensure the pynndescent code path gets run -- not least so tests can run with small datasets but exercise that code path. The warning is (likely? I can't recall) because the pre-computed knns were originally for large data since small datasets could simply use metric="precomputed"
, but to be honest there is no reason not to accept the precomputed knn even in the small case.
I'm trying to feed to UMAP nearest neighbors computed with another method than pynndescent, and so I need to put the data in the format of precomputed_knn that UMAP accept: the tuple: (knn_indices, knn_distances, NNDescentObject) Since I had trouble making an NNDescent object without giving it data, I tried giving it a fake NNDescent object, that returns true to isinstance(knn[2], NNDescent):
The result is that the 2 embeddings are the same, so it seems that the NNDescent object is not used at all. Is my assumption correct, or is the object used in some particular cases, which would explain why it was included in the tuple? I tried digging into the code, but I can't find the place where this object is used...