Open SamGRosen opened 6 months ago
Yes, this is by design. I also found this very confusing initially -- I even opened an issue about it in the UMAP repo (https://github.com/lmcinnes/umap/issues/53) many years ago, but it turns out it's not uncommon to define nearest neighbors this way.
If you need $k$ distinct non-self neighbors, then the best way forward is always to ask for $k + 1$ neighbors. In terms of init
, I don't think it really matters because there is nothing in any of the code that relies on self loops not existing, it's assumed that the self neighbor is just as valid an item in a neighbor list as any other item. For e.g. k = 15
, the fact that you only have 14 "real" neighbors to initialize with versus 15 probably doesn't affect the behavior of the algorithm that much: most candidates are either already seen or aren't moving the search in the right direction.
Using R version
4.2.2
and rnndescent0.1.4
, the returned NN graph contains self-loops. Is this by design, for computational reasons, or a bug? Does this affect performance if the returned graphs are used as theinit
argument in subsequent calls tonnd_knn
? I could always just call the functions with $k+1$ neighbors and remove the first column, but I'm not sure if this will lead to a noticeable difference in speed or recall.Ex:
I see now that this is the behavior of
pynndescent
. However, could you elaborate how to best handle this forinit
arguments, keep the loops or remove them?