Fix base-inherited behavior of variance() on matrices

mjskay commented 3 years ago

The variance() generic inherits a misfeature of base-R var() in that for matrices it returns a covariance matrix. This is a misfeature in my opinion as it means that (1) var() does not parallel sd(); (2) var() returns a different type of output depending on what it guesses the caller's intent is rather than just providing a consistent API; and (3) cov() returns the covariance matrix anyway so there is no need for var() to do so. Because variance() delegates to var() in the default case, it also has this behavior.

For example:

library(distributional)

set.seed(123456)
x = rnorm(16)
sd(x)
## [1] 1.060834
sd(matrix(x, nrow = 4))
## [1] 1.060834
var(x)
## [1] 1.125369
var(matrix(x, nrow = 4))
##            [,1]       [,2]       [,3]       [,4]
## [1,]  0.2946480  0.2999319  0.4823488 -0.3004184
## [2,]  0.2999319  0.6153391  0.1802705 -0.4886662
## [3,]  0.4823488  0.1802705  1.1038920 -0.3054725
## [4,] -0.3004184 -0.4886662 -0.3054725  0.4174298
variance(x)
## [1] 1.125369
variance(matrix(x, nrow = 4))
##            [,1]       [,2]       [,3]       [,4]
## [1,]  0.2946480  0.2999319  0.4823488 -0.3004184
## [2,]  0.2999319  0.6153391  0.1802705 -0.4886662
## [3,]  0.4823488  0.1802705  1.1038920 -0.3054725
## [4,] -0.3004184 -0.4886662 -0.3054725  0.4174298

I would expect variance() on a matrix to behave much like sd() does; i.e. return a single value the same as it does on a vector.

We ran into this problem in {posterior} as we would like people to be able to run summary functions over posterior samples that are stored in matrices (see https://github.com/stan-dev/posterior/issues/121). Since obviously we can't fix base var(), and we already have a dependency on {distributional}, we were hoping to be able to use distributional's variance() for this purpose.

I think the fix should be straightforward, either by changing this function definition:

https://github.com/mitchelloharawild/distributional/blob/ab2fd9e3f62a716a49590d6143283ba830c77c3b/R/distribution.R#L220-L223

to something like this:

#' @export
variance.default <- function(x, ...){
  stats::var(as.vector(x), ...)
}

Or by adding a function definition like this:

#' @export
variance.numeric <- function(x, ...){
  stats::var(as.vector(x), ...)
}

Are either of those changes something you'd be willing to have in distributional? If so I'd be happy to submit a PR for whichever solution you prefer.

mitchelloharawild commented 3 years ago

I don't mind adding a variance(<numeric>) method, and possibly also removing the default (or better, raising a helpful error). This deviation from {stats} for matrices should be mentioned in the documentation.

I'm happy for you to work on a PR for this.

mitchelloharawild commented 2 years ago

I'm preparing a release for this now and wanted to get your opinion on the appropriateness of this change in relation to multivariate distributions. The current behaviour is:

library(distributional)

ux <- c(1, 2)
mx <- matrix(c(0,3,1,4), nrow = 2)
uv <- dist_normal(0,1)
mv <- dist_multivariate_normal(mu = list(c(1,2)), sigma = list(matrix(c(4,2,2,3), ncol=2)))
dimnames(mv) <- c("a", "b")

mean(ux)
#> [1] 1.5
mean(mx)
#> [1] 2
mean(uv)
#> [1] 0
mean(mv)
#>      a b
#> [1,] 1 2

variance(ux)
#> [1] 0.5
variance(mx)
#> [1] 3.333333
variance(uv)
#> [1] 1
variance(mv)
#> [[1]]
#>      [,1] [,2]
#> [1,]    4    2
#> [2,]    2    3

^{Created on 2021-10-04 by the reprex package (v2.0.0)}

I'm now considering changing the multivariate distribution's variance() output to give the diagonal instead of the variance-covariance matrix, so that:

variance(mv)
#>      a b
#> [1,] 4 3

A new generic, covariance() would be added to give the current variance(mv) behaviour.

Does this sound reasonable? The other question would be what covariance(<matrix>) should give? I think it is reasonable to expect a covariance matrix, but in that case shouldn't variance() give the diagonal of the matrix rather than the variance of all columns combined?

mjskay commented 2 years ago

Yeah, I agree --- I think for a multivariate normal I would expect variance() to give the variance of the marginal distributions, i.e. the diagonal of the covariance matrix.

mitchelloharawild commented 2 years ago

Thanks :)

Here's what I've got so far for variance() and covariance(), do you see any problems with this?

library(distributional)

ux <- c(1, 2)
mx <- matrix(c(0,3,1,4), nrow = 2)
uv <- dist_normal(0,1)
mv <- dist_multivariate_normal(mu = list(c(1,2)), sigma = list(matrix(c(4,2,2,3), ncol=2)))
dimnames(mv) <- c("a", "b")

mean(ux)
#> [1] 1.5
mean(mx)
#> [1] 2
mean(uv)
#> [1] 0
mean(mv)
#>      a b
#> [1,] 1 2

variance(ux)
#> [1] 0.5
variance(mx)
#> [1] 3.333333
variance(uv)
#> [1] 1
variance(mv)
#>      a b
#> [1,] 4 3

covariance(ux)
#> Error in stats::cov(x, ...): supply both 'x' and 'y' or a matrix-like 'x'
covariance(mx)
#>      [,1] [,2]
#> [1,]  4.5  4.5
#> [2,]  4.5  4.5
covariance(uv)
#> [1] 1
covariance(mv)
#> [[1]]
#>      [,1] [,2]
#> [1,]    4    2
#> [2,]    2    3

^{Created on 2021-10-11 by the reprex package (v2.0.0)}

mjskay commented 2 years ago

Looks good to me!

mitchelloharawild / distributional

Fix base-inherited behavior of variance() on matrices #55