Open BERENZ opened 4 months ago
Thank you for reporting this. Can you try running these again with use_alt_metric = FALSE
when you being the index, i.e.:
nndes_index <- rnndescent::rnnd_build(data = as.matrix(x_dtm[, colnames_xy]), k = 3, metric = "cosine",
use_alt_metric = FALSE)
You shouldn't have to change the queries in any way.
I think the problem here is that the alternative (usually more efficient) version of the cosine metric is causing some problems. Likely it's because the alternative method involves a transformation where large similarities like 1 get mapped to 0, and those may be causing issues downstream with the sparse representation. I'll have to spend a bit of time investigating this and working out a fix.
In the mean time, setting use_alt_metric = FALSE
should avoid the problem. As an aside, unless this example was created only for this bug report (in which case thank you for being so thorough!) you might want to use a larger value than k = 3
when building the index. In this case, the resulting search graph is ok.
@BERENZ rnndescent
0.1.6 was released to CRAN and should include fixes for both issues you reported.
Main problem
Setup
Load the packages
I have the following small dataset where
register
is, say, the true label andquery
are possible ways to write the label from theregister
vector withother
missing in thequery
.Create sparse matrices with 2 letter shingles using
text2vec
andtokenizers
.Resulting matrices are presented below
Further, I create an index with
rnnd_build
.and I get
Queries
For
k=1
it works as expectedFor
k=2
it gives different resultsAnd if I change it to
k=3
which is exactly the same number of cases as in theregister
vector something is wrongI am reporting this because I do not know how it is related to the size of the dataset.
Comparison with
RcppHNSW
These problems are not present in the
RcppHNSW
packageQuery with
k=1
Query with
k=2
Query with
k=3
Comparison with
RcppAnnoy
viauwot
RcppAnnoy
works the same way asRcppHNSW
(outputs suppressed).Small dataset with
n=2
I imagine that NND algorithms are suited for large data but if the
register
contain only two cases the following error occurs while building a graph whileRcppHNSW
andRcppAnnoy
works file.and traceback gives the following information