Closed original-brownbear closed 4 weeks ago
This is what we're trying to do, right?
Jup exactly :)
I just figured if it's ok to use statistics
, it's nicer to use that code wise that's all, we can do our own N-1 indeed :)
Makes sense to me. @mikemccand Can you confirm that your beast has Python >= 3.8?
I added this p-value feature a while back, and I guess my assumption was always that the denominator would be big enough for this to not matter, but perhaps it's not so?
Sorry got distracted by some emergencies and failed to respond here.
but perhaps it's not so?
It depends :) I kind of suck at explaining this cleanly a decade+ out of university, but If I remember correctly for a normal distribution the 5th percentile is 1.65 standard deviations from the mean. For 20 JVMs, the std deviation is potentially under-estimated by sqrt(20/19) ~= 2.5%. So I guess visually or so you could argue that you're getting the bounds of your p-95 interval wrong by like 5% of a std-deviation on average and that translates into confidence of 95% vs 80% in some cases if you look at the t-values tables.
It's a little more impactful for us I think since we then use those numbers for a geometric mean of two std-deviations doing the t-test without the assumptions of equal std-dev which costs some power vs the student t-test.
Long story short :D I suck at perfectly explaining this nowadays but I'm pretty sure I'm still able to apply this stuff correctly. Just running main
vs. main
and seeing p-95 changes at some rate (or observing the same in the nightly benchmarks when there was no code changes on a weekend) is another indication that we're underestimating std-dev :)
Makes sense to me. @mikemccand Can you confirm that your beast has Python >= 3.8?
beast3
is using Python 3.12.5, all good!
We shouldn't use the population variance when we're clearly sampling. This massively underestimates the variance in some cases, leading to a number of nightlies showing p-values of 0 in the absence of code changes. Implemented this by using python statistics, should be ok to rely on python 3.8 at this point right?