Closed pommedeterresautee closed 6 years ago
@pommedeterresautee I'm sorry for the late reply,
would you mind sharing a reproducible example so that I'm able to find out if it's a bug in the R code that I added to the package. I assume, you mean that the error occurs when you call the init_nms$knn_Query_Batch function.
Thank you for your answer.
I have 100
rows below, and I expect to get 100 * 5
predictions (with k = 5
).
I get 10
predictions and the max row returned is 10
like I have provided a matrix 10X100
and not 100X10
, making me believing that there is a kind of transpose somewhere.
> mat <- Matrix::Matrix(runif(1000, 0, 1), nrow = 100, sparse = TRUE, byrow = TRUE)
> dtm_not_prefix_tfidf_scipy <- dgCMatrix_2scipy_sparse(mat)
>
> dtm_not_prefix_tfidf_scipy$shape
(100, 10)
>
> dtm_not_prefix_tfidf_scipy$getrow(0L)
(0, 0) 0.4460052342619747
(0, 1) 0.267301544547081
(0, 2) 0.9658534412737936
(0, 3) 0.4592860534321517
(0, 4) 0.7499000441748649
(0, 5) 0.29996729688718915
(0, 6) 0.5544748182874173
(0, 7) 0.9539664378389716
(0, 8) 0.8479003391694278
(0, 9) 0.5923537667840719
> mat[1,]
[1] 0.4460052 0.2673015 0.9658534 0.4592861 0.7499000 0.2999673 0.5544748 0.9539664 0.8479003 0.5923538
# The conversion seems ok (dim and content)
>
> init_nms = NMSlib$new(input_data = dtm_not_prefix_tfidf_scipy,
+ Index_Params = list('indexThreadQty' = config$cpu),
+ space = 'cosinesimil_sparse',
+ method = 'hnsw',
+ data_type = 'SPARSE_VECTOR',
+ dtype = 'FLOAT',
+ print_progress = TRUE)
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
>
> b = init_nms$knn_Query_Batch(query_data = dtm_not_prefix_tfidf_scipy,
+ k = 5,
+ num_threads = config$cpu)
>
> dim(b$knn_idx)
[1] 10 5
> max(b$knn_idx)
[1] 10
>
> as.data.frame(b$knn_idx)
V1 V2 V3 V4 V5
1 5 3 8 10 6
2 10 5 7 6 8
3 8 10 1 6 7
4 10 9 7 3 6
5 1 10 7 8 6
6 3 5 10 7 9
7 10 3 5 8 9
8 3 10 5 1 7
9 5 3 7 8 4
10 2 8 7 3 5
I have made several tests with my real data. The R binder nmslibR requires that the sparse matrix is transposed before being converted to scipy. After that nmlibR works as expected. I have no idea why. The very same sparse matrix saved on R (MM format) and reloaded in Python works without requiring any transpose. So there is a kind of bug somewhere in the binder.
@pommedeterresautee I'm sorry for the late reply,
I didn't have access to my computer. I'll start looking at this issue right now.
@pommedeterresautee thanks for making me aware of this issue,
the nmslib python package expects a Compressed Sparse Row matrix (scipy.sparse.csr_matrix). The corresponding one in R is the dgRMatrix-class.
The dgCMatrix_2scipy_sparse() function returns by default a Compressed Sparse Column matrix ( scipy.sparse.csc_matrix), which is not the correct one to use in the nmblib python library.
A detailed example using a scipy sparse matrix in nmslib was added recently by the author.
I will adjust the dgCMatrix_2scipy_sparse() function to return either a sparse column or sparse row matrix (I will notify you once l push the changes on Github).
For now you can receive the expected results using,
set.seed(1)
mat <- Matrix::Matrix(runif(1000, 0, 1), nrow = 100, sparse = TRUE, byrow = TRUE)
class(mat)
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"
dgr_mat = as(mat, "RsparseMatrix")
class(dgr_mat)
# [1] "dgRMatrix"
# attr(,"package")
# [1] "Matrix"
SCP = reticulate::import("scipy", convert = F)
dtm_not_prefix_tfidf_scipy = SCP$sparse$csr_matrix(reticulate::tuple(dgr_mat@x, dgr_mat@j, dgr_mat@p),
shape = reticulate::tuple(dgr_mat@Dim[1], dgr_mat@Dim[2]))
The conversion to a scipy row sparse matrix follows the same logic as before.
dtm_not_prefix_tfidf_scipy$shape
# (100, 10)
dtm_not_prefix_tfidf_scipy$getrow(0L)
# (0, 0) 0.2655086631421
# (0, 1) 0.37212389963679016
# (0, 2) 0.5728533633518964
# (0, 3) 0.9082077899947762
# (0, 4) 0.2016819310374558
# (0, 5) 0.8983896849676967
# (0, 6) 0.9446752686053514
# (0, 7) 0.6607977924868464
# (0, 8) 0.6291140438988805
# (0, 9) 0.061786270467564464
library(nmslibR)
init_nms = NMSlib$new(input_data = dtm_not_prefix_tfidf_scipy, Index_Params = list('indexThreadQty' = 1),
space = 'cosinesimil_sparse', method = 'hnsw', data_type = 'SPARSE_VECTOR', dtype = 'FLOAT',
print_progress = TRUE)
b = init_nms$knn_Query_Batch(query_data = dtm_not_prefix_tfidf_scipy,
k = 5, num_threads = 1)
dim(b$knn_idx)
# [1] 100 5
max(b$knn_idx)
# [1] 100
df_out = as.data.frame(b$knn_idx)
dim(df_out)
# [1] 100 5
head(df_out)
# V1 V2 V3 V4 V5
# 1 24 89 73 51 82
# 2 19 24 75 60 82
# 3 85 5 77 62 65
# 4 94 93 58 8 9
# 5 84 51 42 18 63
# 6 76 71 48 92 66
please test it and let me know.
Hi, I confirm the code above works correctly on my real data compared to my own brute force search implementation.
@pommedeterresautee I did the following changes,
data(ionosphere, package = 'KernelKnn')
X = as.matrix(ionosphere[, -c(1:2, ncol(ionosphere))])
mt_2_sprm = mat_2scipy_sparse(X, format = 'sparse_row_matrix') # first case : R-matrix to scipy-sparse-row-matrix
mt_2_dgr = as(X, "dgRMatrix")
dgr_2_scsp = TO_scipy_sparse(R_sparse_matrix = mt_2_dgr) # second case : R-sparse-matrix to scipy-sparse-row-matrix
# test that both 'dense knn' (using an R object) and 'sparse knn' (using a python scipy sparse object) return the same output
#----------------------------------------------------------------------------------------------------------------------------
dist_knn = KernelKnn::knn.index.dist(X, TEST_data = NULL, k = 5, method = "euclidean") # the corresponding distance for 'euclidean' in nmslibR is 'l2' or 'l2_sparse (in case of sparse matrices). Page 31 of manual.
# nmslibR with "dense" data and sequential search
#------------------------------------------------
init_nms = NMSlib$new(input_data = X, space = "l2", method = 'seq_search',
data_type = 'DENSE_VECTOR', dtype = 'FLOAT', print_progress = F)
all_dat = init_nms$knn_Query_Batch(X, k = 5, num_threads = 1)
# nmslibR with "sparse" data and sequential search
#-------------------------------------------------
init_nms_spr = NMSlib$new(input_data = dgr_2_scsp, space = "l2_sparse", method = 'seq_search',
data_type = 'SPARSE_VECTOR', dtype = 'FLOAT', print_progress = F)
all_dat_spr = init_nms_spr$knn_Query_Batch(dgr_2_scsp, k = 5, num_threads = 1)
# all three outputs (dist_knn, all_dat, all_dat_spr) must return approximately equal results
#-------------------------------------------------------------------------------------------
# indices [ first 6 rows ]
tmp1 = identical(dist_knn$train_knn_idx[1:6, ], all_dat$knn_idx[1:6, ])
tmp2 = identical(all_dat_spr$knn_idx[1:6, ], all_dat$knn_idx[1:6, ])
tmp3 = identical(dist_knn$train_knn_idx[1:6, ], all_dat_spr$knn_idx[1:6, ])
# distances [ last 6 rows ] -- approximately equal ( use of round() function )
tmp_row1 = identical(round(tail(dist_knn$train_knn_dist), 4), round(tail(all_dat$knn_dist), 4))
tmp_row2 = identical(round(tail(all_dat_spr$knn_idx), 4), round(tail(all_dat$knn_idx), 4))
tmp_row3 = identical(round(tail(dist_knn$train_knn_idx), 4), round(tail(all_dat_spr$knn_idx), 4))
print(all(tmp1, tmp2, tmp3, tmp_row1, tmp_row2, tmp_row3)
The results for the ionoshpere data set agree for dense or sparse data (after conversion from an R sparse matrix to a scipy sparse matrix)
I submitted the new version to CRAN and uploaded the updated on Github.
Feel free to comment / close the issue if the updated version returns the expected output.
I close the issue for now, feel free to reopen it.
Had the same issue - root cause was that the space parameter wasn't cosinesimil_sparse
but cosinesimil
adding input verification that input_data
, space
, and data_type
are compatible could help avoid similar issues
hi @ophiry, in page 31 of the nmslib-manual there are details about each distance metric that can be used either in the python or in this R package. Every metric with a '_sparse' extension is meant to be used with sparse matrices. After my changes based on this issue I don't think there is much more to do. Don't forget that this R package is a port of the python package using reticulate, which means that error cases related to input data types must be addressed in the python code.
I have made several tests with sparse matrix based on the Vignette and
/tests
folder. Results are obviously wrong, even more strange is that it seems that there is a transpose of the matrix (I have 700K rows and 14K columns, and when I performknn_Query_Batch
on the full matrix, the max index returned is 14K.Is there something I need to do to make it work correctly?
If I change the space to
l1
(same code as above) I get the following error at the init part (there is an error per thread):