Closed mjskay closed 3 years ago
I don't mind adding a variance(<numeric>)
method, and possibly also removing the default (or better, raising a helpful error). This deviation from {stats} for matrices should be mentioned in the documentation.
I'm happy for you to work on a PR for this.
I'm preparing a release for this now and wanted to get your opinion on the appropriateness of this change in relation to multivariate distributions. The current behaviour is:
library(distributional)
ux <- c(1, 2)
mx <- matrix(c(0,3,1,4), nrow = 2)
uv <- dist_normal(0,1)
mv <- dist_multivariate_normal(mu = list(c(1,2)), sigma = list(matrix(c(4,2,2,3), ncol=2)))
dimnames(mv) <- c("a", "b")
mean(ux)
#> [1] 1.5
mean(mx)
#> [1] 2
mean(uv)
#> [1] 0
mean(mv)
#> a b
#> [1,] 1 2
variance(ux)
#> [1] 0.5
variance(mx)
#> [1] 3.333333
variance(uv)
#> [1] 1
variance(mv)
#> [[1]]
#> [,1] [,2]
#> [1,] 4 2
#> [2,] 2 3
Created on 2021-10-04 by the reprex package (v2.0.0)
I'm now considering changing the multivariate distribution's variance()
output to give the diagonal instead of the variance-covariance matrix, so that:
variance(mv)
#> a b
#> [1,] 4 3
A new generic, covariance()
would be added to give the current variance(mv)
behaviour.
Does this sound reasonable? The other question would be what covariance(<matrix>)
should give? I think it is reasonable to expect a covariance matrix, but in that case shouldn't variance()
give the diagonal of the matrix rather than the variance of all columns combined?
Yeah, I agree --- I think for a multivariate normal I would expect variance()
to give the variance of the marginal distributions, i.e. the diagonal of the covariance matrix.
Thanks :)
Here's what I've got so far for variance()
and covariance()
, do you see any problems with this?
library(distributional)
ux <- c(1, 2)
mx <- matrix(c(0,3,1,4), nrow = 2)
uv <- dist_normal(0,1)
mv <- dist_multivariate_normal(mu = list(c(1,2)), sigma = list(matrix(c(4,2,2,3), ncol=2)))
dimnames(mv) <- c("a", "b")
mean(ux)
#> [1] 1.5
mean(mx)
#> [1] 2
mean(uv)
#> [1] 0
mean(mv)
#> a b
#> [1,] 1 2
variance(ux)
#> [1] 0.5
variance(mx)
#> [1] 3.333333
variance(uv)
#> [1] 1
variance(mv)
#> a b
#> [1,] 4 3
covariance(ux)
#> Error in stats::cov(x, ...): supply both 'x' and 'y' or a matrix-like 'x'
covariance(mx)
#> [,1] [,2]
#> [1,] 4.5 4.5
#> [2,] 4.5 4.5
covariance(uv)
#> [1] 1
covariance(mv)
#> [[1]]
#> [,1] [,2]
#> [1,] 4 2
#> [2,] 2 3
Created on 2021-10-11 by the reprex package (v2.0.0)
Looks good to me!
The
variance()
generic inherits a misfeature of base-Rvar()
in that for matrices it returns a covariance matrix. This is a misfeature in my opinion as it means that (1)var()
does not parallelsd()
; (2)var()
returns a different type of output depending on what it guesses the caller's intent is rather than just providing a consistent API; and (3)cov()
returns the covariance matrix anyway so there is no need forvar()
to do so. Becausevariance()
delegates tovar()
in the default case, it also has this behavior.For example:
I would expect
variance()
on a matrix to behave much likesd()
does; i.e. return a single value the same as it does on a vector.We ran into this problem in {posterior} as we would like people to be able to run summary functions over posterior samples that are stored in matrices (see https://github.com/stan-dev/posterior/issues/121). Since obviously we can't fix base
var()
, and we already have a dependency on {distributional}, we were hoping to be able to use distributional'svariance()
for this purpose.I think the fix should be straightforward, either by changing this function definition:
https://github.com/mitchelloharawild/distributional/blob/ab2fd9e3f62a716a49590d6143283ba830c77c3b/R/distribution.R#L220-L223
to something like this:
Or by adding a function definition like this:
Are either of those changes something you'd be willing to have in distributional? If so I'd be happy to submit a PR for whichever solution you prefer.