Correlation produces incorrect results when the standard deviation is zero

koheiw / proxyC

R package for large-scale similarity/distance computation

GNU General Public License v3.0

29 stars 6 forks source link

Correlation produces incorrect results when the standard deviation is zero #20

Closed rcannood closed 3 years ago

rcannood commented 3 years ago

Sorry, I don't have a solution for this issue yet. When I compute the correlation on vectors with sd == 0, the output correlation is 0 or +inf instead of NA or NaN.

Example:

mat1 <- Matrix::Matrix(1:4, nrow = 1, sparse = TRUE)
mat2 <- Matrix::Matrix(rep(0.1111, 4), nrow = 1, sparse = TRUE)
proxyC::simil(mat1, mat2, method = "correlation")
# [1,] Inf
proxy::simil(as.matrix(mat1), as.matrix(mat2), method = "correlation")
# [1,] NaN
cor(t(as.matrix(mat1)), t(as.matrix(mat2)), method = "pearson")
# [1,] NA

mat1 <- Matrix::Matrix(1:5, nrow = 1, sparse = TRUE)
mat2 <- Matrix::Matrix(rep(0.1111, 5), nrow = 1, sparse = TRUE)
proxyC::simil(mat1, mat2, method = "correlation")
# [1,] .
proxy::simil(as.matrix(mat1), as.matrix(mat2), method = "correlation")
# [1,] NaN
cor(t(as.matrix(mat1)), t(as.matrix(mat2)), method = "pearson")
# [1,] NA

Needless to say, this can lead to some very strange behaviour in downstream results ;)

I created a branch, issue_20, which breaks a test to address this issue.

koheiw commented 3 years ago

Thank for this. I intended to return zero if a vector has no variance as in mat1 and mat3 but mat1 and mat2 is an unexpected behavior that I need to fix.

> mat1 <- Matrix::Matrix(1:4, nrow = 1, sparse = TRUE)
> mat2 <- Matrix::Matrix(rep(0.1111, 4), nrow = 1, sparse = TRUE)
> mat3 <- Matrix::Matrix(rep(1, 4), nrow = 1, sparse = TRUE)
> proxyC::simil(mat1, mat2, method = "correlation")
1 x 1 sparse Matrix of class "dgTMatrix"

[1,] Inf
> proxyC::simil(mat1, mat3, method = "correlation")
1 x 1 sparse Matrix of class "dgTMatrix"

[1,] .

Returning NA is more difficult because it makes the matrix dense. I wanted users to deal with vectors without variance using rowSds() or colSds() manually, but happy to change if there is a better way.

> proxyC::rowSds(mat1)
[1] 1.290994
> proxyC::rowSds(mat2)
[1] 0

rcannood commented 3 years ago

Returning zeros instead of NAs makes sense in this context.

Would it make sense to throw a one-time warning if one of the standard deviations is zero?

rcannood commented 3 years ago

I committed a proposed solution for this issue. Feel free to reject it and provide a different solution altogether.