BERENZ commented 4 months ago

Main problem

Setup

Load the packages

library(RcppHNSW)
library(rnndescent)
library(tokenizers)
library(text2vec)

I have the following small dataset where register is, say, the true label and query are possible ways to write the label from the register vector with other missing in the query.

register <- c("montypython", "kowalskijan", "other")

query <- c(
  "jankowalski",
  "kowalskijan",
  "kowalskimjan",
  "kowaljan",
  "montypython",
  "pythonmonty",
  "cyrkmontypython",
  "monty"
)

Create sparse matrices with 2 letter shingles using text2vec and tokenizers.

x_tokens <- text2vec::itoken_parallel(
  iterable = register,
  tokenizer = function(x) tokenizers::tokenize_character_shingles(x, n = 2),
  n_chunks = 10, progressbar = verbose)
x_voc <- text2vec::create_vocabulary(x_tokens)
x_vec <- text2vec::vocab_vectorizer(x_voc)
x_dtm <- text2vec::create_dtm(x_tokens, x_vec)

y_tokens <- text2vec::itoken_parallel(
  iterable = query,
  tokenizer = function(x) tokenizers::tokenize_character_shingles(x, n = 2),
  n_chunks = 10, progressbar = verbose)
y_voc <- text2vec::create_vocabulary(y_tokens)
y_vec <- text2vec::vocab_vectorizer(y_voc)
y_dtm <- text2vec::create_dtm(y_tokens, y_vec)

colnames_xy <- intersect(colnames(x_dtm), colnames(y_dtm))

Resulting matrices are presented below

> x_dtm
3 x 22 sparse Matrix of class "dgCMatrix"
  [[ suppressing 22 column names 'al', 'an', 'er' ... ]]

1 . . . . 1 . . . . . 1 1 . . 1 . 1 . 1 1 2 1
2 1 1 . . . 1 1 1 1 1 . . . 1 . 1 . 1 . . . .
3 . . 1 1 . . . . . . . . 1 . . . . . . . . 1

> y_dtm
8 x 28 sparse Matrix of class "dgCMatrix"
  [[ suppressing 28 column names 'cy', 'ij', 'im' ... ]]

1 . . . . . . 1 . . . . . 1 1 . 1 . . 1 1 1 1 . . 1 . 1 .
2 . 1 . . . . . . . . . . 1 1 . 1 . . 1 1 1 1 . . 1 . 1 .
3 . . 1 . . 1 . . . . . . 1 1 . 1 . . 1 1 1 1 . . 1 . 1 .
4 . . . . 1 . . . . . . . . . . . . . 1 1 1 1 . . 1 . 1 .
5 . . . . . . . . . . 1 1 . . 1 . 1 1 . . . . 1 1 . 1 . 2
6 . . . . . . . 1 . . . 1 . . 1 . 1 1 . . . . 1 1 . 1 . 2
7 1 . . 1 . . . . 1 1 1 1 . . 1 . 1 1 . . . . 1 1 . 1 . 2
8 . . . . . . . . . . . . . . . . . . . . . . 1 1 . 1 . 1

Further, I create an index with rnnd_build.

set.seed(2024)
nndes_index <- rnndescent::rnnd_build(data = as.matrix(x_dtm[, colnames_xy]), k = 3, metric = "cosine")

and I get

> nndes_index$graph
$idx
     [,1] [,2] [,3]
[1,]    1    3    2
[2,]    2    3    1
[3,]    3    1    2

$dist
     [,1]      [,2] [,3]
[1,]    0 0.7113249    1
[2,]    0 1.0000000    1
[3,]    0 0.7113249    1

Queries

For k=1 it works as expected

> nndes_result <- rnndescent::rnnd_query(nndes_index, query = as.matrix(y_dtm[, colnames_xy]), k = 1)
> nndes_result
$idx
     [,1]
[1,]    2
[2,]    2
[3,]    2
[4,]    2
[5,]    1
[6,]    1
[7,]    1
[8,]    1

$dist
           [,1]
[1,] 0.05131668
[2,] 0.00000000
[3,] 0.05131668
[4,] 0.22540338
[5,] 0.00000000
[6,] 0.04257286
[7,] 0.00000000
[8,] 0.27831215

For k=2 it gives different results

> nndes_result <- rnndescent::rnnd_query(nndes_index, query = as.matrix(y_dtm[, colnames_xy]), k = 2)
> nndes_result
$idx
     [,1] [,2]
[1,]    1    2
[2,]    2    1
[3,]    1    2
[4,]    1    2
[5,]    1    3
[6,]    1    3
[7,]    1    3
[8,]    2    1

$dist
           [,1]       [,2]
[1,] 0.00000000 0.05131668
[2,] 0.00000000 0.00000000
[3,] 0.00000000 0.05131668
[4,] 0.00000000 0.22540338
[5,] 0.00000000 0.71132485
[6,] 0.04257286 0.69848866
[7,] 0.00000000 0.71132485
[8,] 0.00000000 0.27831215

And if I change it to k=3 which is exactly the same number of cases as in the register vector something is wrong

> nndes_result <- rnndescent::rnnd_query(nndes_index, query = as.matrix(y_dtm[, colnames_xy]), k = 3)
> nndes_result
$idx
     [,1] [,2] [,3]
[1,]    3    1    2
[2,]    2    1    3
[3,]    3    1    2
[4,]    3    1    2
[5,]    1    2    3
[6,]    2    1    3
[7,]    1    2    3
[8,]    3    2    1

$dist
     [,1]       [,2]       [,3]
[1,]    0 0.00000000 0.05131668
[2,]    0 0.00000000 0.00000000
[3,]    0 0.00000000 0.05131668
[4,]    0 0.00000000 0.22540338
[5,]    0 0.00000000 0.71132485
[6,]    0 0.04257286 0.69848866
[7,]    0 0.00000000 0.71132485
[8,]    0 0.00000000 0.27831215

I am reporting this because I do not know how it is related to the size of the dataset.

Comparison with `RcppHNSW`

These problems are not present in the RcppHNSW package

hnsw_index <- RcppHNSW::hnsw_build(X = as.matrix(x_dtm[, colnames_xy]), distance = "cosine")

Query with k=1

> hnsw_result <- RcppHNSW::hnsw_search(X = as.matrix(y_dtm[, colnames_xy]), ann = hnsw_index,  k = 1)
> hnsw_result
$idx
     [,1]
[1,]    2
[2,]    2
[3,]    2
[4,]    2
[5,]    1
[6,]    1
[7,]    1
[8,]    1

$dist
              [,1]
[1,]  5.131668e-02
[2,] -1.192093e-07
[3,]  5.131668e-02
[4,]  2.254034e-01
[5,]  1.192093e-07
[6,]  4.257292e-02
[7,]  1.192093e-07
[8,]  2.783122e-01

Query with k=2

> hnsw_result <- RcppHNSW::hnsw_search(X = as.matrix(y_dtm[, colnames_xy]), ann = hnsw_index,  k = 2)
> hnsw_result
$idx
     [,1] [,2]
[1,]    2    3
[2,]    2    3
[3,]    2    3
[4,]    2    3
[5,]    1    3
[6,]    1    3
[7,]    1    3
[8,]    1    3

$dist
              [,1]      [,2]
[1,]  5.131668e-02 1.0000000
[2,] -1.192093e-07 1.0000000
[3,]  5.131668e-02 1.0000000
[4,]  2.254034e-01 1.0000000
[5,]  1.192093e-07 0.7113249
[6,]  4.257292e-02 0.6984887
[7,]  1.192093e-07 0.7113249
[8,]  2.783122e-01 1.0000000

Query with k=3

> hnsw_result <- RcppHNSW::hnsw_search(X = as.matrix(y_dtm[, colnames_xy]), ann = hnsw_index,  k = 3)
> hnsw_result
$idx
     [,1] [,2] [,3]
[1,]    2    1    3
[2,]    2    1    3
[3,]    2    1    3
[4,]    2    1    3
[5,]    1    3    2
[6,]    1    3    2
[7,]    1    3    2
[8,]    1    2    3

$dist
              [,1]      [,2] [,3]
[1,]  5.131668e-02 1.0000000    1
[2,] -1.192093e-07 1.0000000    1
[3,]  5.131668e-02 1.0000000    1
[4,]  2.254034e-01 1.0000000    1
[5,]  1.192093e-07 0.7113249    1
[6,]  4.257292e-02 0.6984887    1
[7,]  1.192093e-07 0.7113249    1
[8,]  2.783122e-01 1.0000000    1

Comparison with `RcppAnnoy` via `uwot`

RcppAnnoy works the same way as RcppHNSW (outputs suppressed).

annoy_index <- uwot:::annoy_build(X = as.matrix(x_dtm[, colnames_xy]), metric = "cosine")
uwot:::annoy_search(X = as.matrix(y_dtm[, colnames_xy]), k = 1, ann = annoy_index)
uwot:::annoy_search(X = as.matrix(y_dtm[, colnames_xy]), k = 2, ann = annoy_index)
uwot:::annoy_search(X = as.matrix(y_dtm[, colnames_xy]), k = 3, ann = annoy_index)

Small dataset with `n=2`

I imagine that NND algorithms are suited for large data but if the register contain only two cases the following error occurs while building a graph while RcppHNSW and RcppAnnoy works file.

set.seed(2024)
nndes_index <- rnndescent::rnnd_build(data = as.matrix(x_dtm[1:2, colnames_xy]), k = 2, metric = "cosine")

Error in (function (cl, name, valueClass)  : 
  assignment of an object of class "list" is not valid for @'x' in an object of class "dgCMatrix"; is(value, "numeric") is not TRUE

and traceback gives the following information

> traceback()
4: stop(gettextf("assignment of an object of class %s is not valid for @%s in an object of class %s; is(value, \"%s\") is not TRUE", 
       dQuote(valueClass), sQuote(name), dQuote(cl), slotClass), 
       domain = NA)
3: (function (cl, name, valueClass) 
   {
       ClassDef <- getClass(cl)
       slotClass <- ClassDef@slots[[name]]
       if (is.null(slotClass)) 
           stop(gettextf("%s is not a slot in class %s", sQuote(name), 
               dQuote(cl)), domain = NA)
       if (.identC(slotClass, valueClass)) 
           return(TRUE)
       ok <- possibleExtends(valueClass, slotClass, ClassDef2 = getClassDef(slotClass, 
           where = .classEnv(ClassDef)))
       if (isFALSE(ok)) 
           stop(gettextf("assignment of an object of class %s is not valid for @%s in an object of class %s; is(value, \"%s\") is not TRUE", 
               dQuote(valueClass), sQuote(name), dQuote(cl), slotClass), 
               domain = NA)
       TRUE
   })(structure("dgCMatrix", package = "Matrix"), "x", "list")
2: prepare_search_graph(data = index$data, graph = index$graph, 
       metric = index$original_metric, use_alt_metric = index$use_alt_metric, 
       diversify_prob = index$prep$diversify_prob, pruning_degree_multiplier = index$prep$pruning_degree_multiplier, 
       prune_reverse = index$prep$prune_reverse, n_threads = n_threads, 
       verbose = verbose, obs = "C")
1: rnndescent::rnnd_build(data = as.matrix(x_dtm[1:2, colnames_xy]), 
       k = 2, metric = "cosine")

jlmelville commented 4 months ago

Thank you for reporting this. Can you try running these again with use_alt_metric = FALSE when you being the index, i.e.:

nndes_index <- rnndescent::rnnd_build(data = as.matrix(x_dtm[, colnames_xy]), k = 3, metric = "cosine", 
                                      use_alt_metric = FALSE)

You shouldn't have to change the queries in any way.

I think the problem here is that the alternative (usually more efficient) version of the cosine metric is causing some problems. Likely it's because the alternative method involves a transformation where large similarities like 1 get mapped to 0, and those may be causing issues downstream with the sparse representation. I'll have to spend a bit of time investigating this and working out a fix.

In the mean time, setting use_alt_metric = FALSE should avoid the problem. As an aside, unless this example was created only for this bug report (in which case thank you for being so thorough!) you might want to use a larger value than k = 3 when building the index. In this case, the resulting search graph is ok.

jlmelville commented 4 months ago

@BERENZ rnndescent 0.1.6 was released to CRAN and should include fixes for both issues you reported.

jlmelville / rnndescent

Query result on small index dataset #14

Main problem

Setup

Queries

Comparison with `RcppHNSW`

Comparison with `RcppAnnoy` via `uwot`

Small dataset with `n=2`

jlmelville / rnndescent

Query result on small index dataset #14

Main problem

Setup

Queries

Comparison with RcppHNSW

Comparison with RcppAnnoy via uwot

Small dataset with n=2

Comparison with `RcppHNSW`

Comparison with `RcppAnnoy` via `uwot`

Small dataset with `n=2`