Stopping condition for matches based on beta distributional assumption

Akababa commented 6 years ago

https://gist.github.com/Akababa/356b6881cb63cfc05094c83ba3761ef0

I tried coming up with a new stopping condition for matches by calculating various probabilities as follows: if you let p=P(new network wins one game), it seems reasonable that newly trained networks would have a p following a beta distribution (say beta(2,2)) relative to the current best network. So if we know that new net won w of the first n games (in other words, we obtain a datum from Binom(n,p)), the posterior distribution f(p|(w,n)) is beta(2+w,2+n-w) since beta is conjugate prior to binomial. Using this, we just evaluate 1-cdf(0.55) to obtain P(p>0.55).

I tried this test on a few matches from lczero.org and was surprised at how powerful it was! When SPRT gives up (at alpha=0.05?), the beta-test typically returns p in the range of 0.0001, almost 3 orders of magnitude lower in false-negative rate! I thought it was too good to be true, so I simulated the random process and came up with the same values every single time. But I still think something must be wrong since Stockfish has been using SPRT forever. In fact I was able to make the test even stronger by using the posterior of p to compute P(final wins>220 | (w,n)). In simulations with it takes an average of 30-40 games to either accept or reject with both error rates 0.01.

Does anyone know what's going on here? I think I must have made an incorrect assumption, or messed up the simulations somehow.

vdbergh commented 6 years ago

If you assume a known prior then you can do much better than the SPRT. But in case your prior is wrong (very likely, since on what grounds will you choose it?) your error probabilities will be incorrect and possibly very bad.

The SPRT guarantees are independent of any prior.

Akababa commented 6 years ago

I'm aware of that, but is there any reason to not assume a known prior when it comes from a well-defined process like this one? I found that the test was not very sensitive to the parameters of the prior (since 2 is small compared to # of games), so in practice it would suffice to assume p is from any beta distribution, which seems almost certainly true.

vdbergh commented 6 years ago

The point is not the sensitivity depending on priors, but the fact that the prior may be wrong so the Bayesian reasoning gives the wrong result.

If you do a simulation for fixed elo and you stop once the posterior indicates that the probability P(wins>0.55)>=0.95 or P(wins>0.55)<=0.05 then you will see that error rates in reality are bigger than 0.05.

Akababa commented 6 years ago

That's true, I did get a huge error rate for p=0.56...

BUT if we did measure the p-distribution of newly-trained networks somehow, and determined that it is indeed a beta distribution, would it be safe to use this test? Or is it possible that they don't follow a beta distribution at all?

glinscott / leela-chess

Stopping condition for matches based on beta distributional assumption #263