Self-loops in NN graph - Githubissues

jlmelville / rnndescent

R package implementing the Nearest Neighbor Descent method for approximate nearest neighbors

GNU General Public License v3.0

10 stars 2 forks source link

Using R version 4.2.2 and rnndescent 0.1.4, the returned NN graph contains self-loops. Is this by design, for computational reasons, or a bug? Does this affect performance if the returned graphs are used as the init argument in subsequent calls to nnd_knn? I could always just call the functions with $k+1$ neighbors and remove the first column, but I'm not sure if this will lead to a noticeable difference in speed or recall.

Ex:

test <- matrix(rnorm(200), nrow=20)
brute <- brute_force_knn(test, 4)
approx <- nnd_knn(test, 4)
brute$idx[, 1] #  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
approx$idx[, 1] #  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

I see now that this is the behavior of pynndescent. However, could you elaborate how to best handle this for init arguments, keep the loops or remove them?

Yes, this is by design. I also found this very confusing initially -- I even opened an issue about it in the UMAP repo (https://github.com/lmcinnes/umap/issues/53) many years ago, but it turns out it's not uncommon to define nearest neighbors this way.

If you need $k$ distinct non-self neighbors, then the best way forward is always to ask for $k + 1$ neighbors. In terms of init, I don't think it really matters because there is nothing in any of the code that relies on self loops not existing, it's assumed that the self neighbor is just as valid an item in a neighbor list as any other item. For e.g. k = 15, the fact that you only have 14 "real" neighbors to initialize with versus 15 probably doesn't affect the behavior of the algorithm that much: most candidates are either already seen or aren't moving the search in the right direction.

jlmelville / rnndescent

Self-loops in NN graph #12