koheiw / proxyC

R package for large-scale similarity/distance computation
GNU General Public License v3.0
29 stars 6 forks source link

cosine produces values larger than 1 #5

Closed rcannood closed 5 years ago

rcannood commented 5 years ago

Due to rounding errors, the similarity in the current implementation can be larger than 1. This is not really desired, as calculating for instance cos^-1 will then produce NaNs:

> acos(sim)
2 x 2 Matrix of class "dsyMatrix"
         [,1]     [,2]
[1,]      NaN 0.390607
[2,] 0.390607      NaN

Whereas by setting the maximum value to 1, everything will work as intended:

> sim@x[sim@x > 1] <- 1
> acos(sim)
2 x 2 Matrix of class "dsyMatrix"
         [,1]     [,2]
[1,] 0.000000 0.390607
[2,] 0.390607 0.000000

This can easily be tested with the following test:

test_that("cosine similarity can't be larger than 1", {
  x <- Matrix::Matrix(c(1, 2, 5, 3), ncol = 2, sparse = TRUE)
  sim <- simil(x, y = NULL, method = "cosine")
  expect_true(all(sim <= 1))
})
koheiw commented 5 years ago

How about using zapsmall?

> getOption("digits")
[1] 7
> zapsmall(1 + 1e-6)
[1] 1.000001
> zapsmall(1 + 1e-7)
[1] 1
rcannood commented 5 years ago

Cool, I didn't know about zapsmall()! It does the job just fine:

library(proxyC)
x <- Matrix::Matrix(c(1, 2, 5, 3), ncol = 2, sparse = TRUE)
sim <- simil(x, y = NULL, method = "cosine")
acos(sim)
## 2 x 2 Matrix of class "dsyMatrix"
##          [,1]     [,2]
## [1,]      NaN 0.390607
## [2,] 0.390607      NaN
## Warning message:
## In acos(x@x) : NaNs produced
sim@x <- zapsmall(sim@x)
acos(sim)
## 2 x 2 Matrix of class "dsyMatrix"
##          [,1]     [,2]
## [1,] 0.000000 0.390607
## [2,] 0.390607 0.000000
rcannood commented 5 years ago

I don't mind running zapsmall() right before calculating the acos, so you don't necessarily have to include it in proxyC. What do you prefer?

koheiw commented 5 years ago

I think it is better to make cosine similarity to be precisely up to 1. Adding zapsmall() is easy but getOption("digits") is also used in print(). If we apply zapsmall() internally, we have to add a zapsmall argument to proxy() to avoid taking values from the option. This approach is similar to digest::sha1().