mikemccand / luceneutil

Various utility scripts for running Lucene performance tests
Apache License 2.0
205 stars 115 forks source link

Use sample-stddev instead of population-stddev in t-test #308

Closed original-brownbear closed 4 weeks ago

original-brownbear commented 1 month ago

We shouldn't use the population variance when we're clearly sampling. This massively underestimates the variance in some cases, leading to a number of nightlies showing p-values of 0 in the absence of code changes. Implemented this by using python statistics, should be ok to rely on python 3.8 at this point right?

original-brownbear commented 1 month ago

This is what we're trying to do, right?

Jup exactly :) I just figured if it's ok to use statistics, it's nicer to use that code wise that's all, we can do our own N-1 indeed :)

jpountz commented 1 month ago

Makes sense to me. @mikemccand Can you confirm that your beast has Python >= 3.8?

msokolov commented 1 month ago

I added this p-value feature a while back, and I guess my assumption was always that the denominator would be big enough for this to not matter, but perhaps it's not so?

original-brownbear commented 1 month ago

Sorry got distracted by some emergencies and failed to respond here.

but perhaps it's not so?

It depends :) I kind of suck at explaining this cleanly a decade+ out of university, but If I remember correctly for a normal distribution the 5th percentile is 1.65 standard deviations from the mean. For 20 JVMs, the std deviation is potentially under-estimated by sqrt(20/19) ~= 2.5%. So I guess visually or so you could argue that you're getting the bounds of your p-95 interval wrong by like 5% of a std-deviation on average and that translates into confidence of 95% vs 80% in some cases if you look at the t-values tables. It's a little more impactful for us I think since we then use those numbers for a geometric mean of two std-deviations doing the t-test without the assumptions of equal std-dev which costs some power vs the student t-test. Long story short :D I suck at perfectly explaining this nowadays but I'm pretty sure I'm still able to apply this stuff correctly. Just running main vs. main and seeing p-95 changes at some rate (or observing the same in the nightly benchmarks when there was no code changes on a weekend) is another indication that we're underestimating std-dev :)

mikemccand commented 3 weeks ago

Makes sense to me. @mikemccand Can you confirm that your beast has Python >= 3.8?

beast3 is using Python 3.12.5, all good!