Open karpfen opened 3 months ago
Thank you for looking into this! I suppose the right place to hook up a function like this to aggregate chunks is here
Thanks, this is the place I was looking for. Before I start on this, should I wait for your performance branches to be merged? I don't think it would cause too much friction, actually.
No, please go ahead and open an PR if you are ready.
Currently, the summary statistics for indicator values in batch processing are combined by nesting them. For variance, this leads to unexpected results.
normally, you'd estimate the variance as:
$Var(X) = \frac{1} {(N - 1)} \sum(x_i - \overline x)^2$
which we can decompose into
$Var(X) = \frac{\sum(n_i - 1) s_i^2 + \sum n_i (m_i - M)^2} {N - 1}$
where
This works in general, but I have not had a look yet how practical this would be to implement in the package, just wanted to start this issue as a FYI for now.