mapme-initiative / mapme.biodiversity

Efficient analysis of spatial biodiversity datasets for global portfolios
https://mapme-initiative.github.io/mapme.biodiversity/dev
GNU General Public License v3.0
33 stars 7 forks source link

Variance of Batch processing results #327

Open karpfen opened 4 weeks ago

karpfen commented 4 weeks ago

Currently, the summary statistics for indicator values in batch processing are combined by nesting them. For variance, this leads to unexpected results.

normally, you'd estimate the variance as:

$Var(X) = \frac{1} {(N - 1)} \sum(x_i - \overline x)^2$

which we can decompose into

$Var(X) = \frac{\sum(n_i - 1) s_i^2 + \sum n_i (m_i - M)^2} {N - 1}$

where

This works in general, but I have not had a look yet how practical this would be to implement in the package, just wanted to start this issue as a FYI for now.

x <- runif(100)
chunk_size <- 3
x_chunk <- split(x, ceiling(seq_along(x) / chunk_size))

n_i <- sapply(x_chunk, length)
s_i2 <- sapply(x_chunk, var)
m_i <- sapply(x_chunk, mean)

N <- sum(n_i)
M <- sum((n_i * m_i)) / N

chunked_var <- (sum((n_i - 1) * s_i2, na.rm = TRUE) + sum(n_i * (m_i - M)^2, na.rm = TRUE)) / (N - 1)

testthat::expect_equal(
  var(x),
  chunked_var
)
goergen95 commented 4 weeks ago

Thank you for looking into this! I suppose the right place to hook up a function like this to aggregate chunks is here

https://github.com/mapme-initiative/mapme.biodiversity/blob/f92256e727d92ea816e558d8b9cb890c41e3c85f/R/chunking.R#L130-L144

karpfen commented 4 weeks ago

Thanks, this is the place I was looking for. Before I start on this, should I wait for your performance branches to be merged? I don't think it would cause too much friction, actually.

goergen95 commented 4 weeks ago

No, please go ahead and open an PR if you are ready.