koheiw / proxyC

R package for large-scale similarity/distance computation
GNU General Public License v3.0
29 stars 6 forks source link

Cosine does not return value even when min_simil or rank is used #22

Closed koheiw closed 2 years ago

koheiw commented 3 years ago
> mat <- Matrix::Matrix(matrix(c(0, 0, 0, 2, 3, 4), byrow = TRUE, nrow = 2), sparse = TRUE)
> proxyC::simil(mat, method = "cosine")
2 x 2 sparse Matrix of class "dsTMatrix"

[1,] . .
[2,] . 1
> proxyC::simil(mat, method = "cosine", min_simil = -1, drop0 = FALSE)
2 x 2 sparse Matrix of class "dsTMatrix"

[1,] . .
[2,] . 1
> proxyC::simil(mat, method = "cosine", rank = 100, drop0 = FALSE)
2 x 2 sparse Matrix of class "dgTMatrix"

[1,] . .
[2,] . 1
> proxyC::simil(mat, method = "cosine", rank = 100, drop0 = FALSE, use_nan = TRUE)
2 x 2 sparse Matrix of class "dgTMatrix"

[1,] . .
[2,] . 1

This is happening because simils[k] is -nan, because of the all-zero raw vector.

https://github.com/koheiw/proxyC/blob/96770543908922f93c431cd8d64df483d85fb74d/src/linear.cpp#L90

It should return 0 or NaN but I am not sure if cosine([0, 0, 0], [0, 0, 0]) = 0. @rcannood what is your thought?

I also feel that drop0 is misleading because our sparse outputs do not contain zeros. It can be called differently.

koheiw commented 3 years ago

More general solution is https://github.com/koheiw/proxyC/commit/be58e72af5c55bc333c4f011550058645a62af87. We can apply this to any simil/dist measures (if we want to).

> proxyC::simil(mat, method = "cosine", use_nan = TRUE)
2 x 2 sparse Matrix of class "dsTMatrix"

[1,] NaN NaN
[2,] NaN   1
> proxyC::simil(mat, method = "correlation", use_nan = TRUE)
2 x 2 sparse Matrix of class "dsTMatrix"

[1,] NaN NaN
[2,] NaN   1