Open Rocky640 opened 6 years ago
Hi rocky, sounds like an interesting idea. Could you describe me a case in which you would like to use this stopping rule?
Apart from regression test, some "comparison, or milestone tests" such as http://tests.stockfishchess.org/tests/view/5a6ac83d0ebc590d945d59b3 could finish faster.
Also, often before trying to improve a feature, we remove the feature and need a elo mesure. Is this feature worth 5 elo, 10, 30...or just 1 ? Is it worth the trouble to spend 10 tests to tweak it ?
One just need a quick measure in the +- 2 interval with a 90% confidence.
We can also get some elo measure running a sprt and use a tool such as http://hardy.uhasselt.be/Toga/live_elo.html but this is not exactly the same thing
I see what you mean. Maybe a first iteration would be to add a link to the correct page of live_elo on the test page, such that it is even easier to use it.
This proposal makes a lot of sense. Fixed length tests are typically used to make an elo measurement (if you want a binary answer then the SPRT is more efficient). So instead of asking for the length of the test it makes much more sense to ask for the desired accuracy.
I am sure this is theoretically ok. However it would be best to provide some canned tests (like in the SPRT case) with reasonable ressource consumption. Otherwise people will ask for an accuracy of +-0.5 elo ... If there is interest I can hack a quick python script to make the conversion length <-> accuracy (it is a trivial computation so anyone can of course do it).
The conversion between length and b accuracy actually depends on draw rate, so it is not that trivial.
The new stop rule would just be to stop the test once accuracy (95%) reached a given level, right?
This sounds very good!
to be clear the way I see it is a) user would set a confidence level (95%, 90%, ...) b) user would request some error bars (the "accuracy") (+- 1.0, +- 1.5, +/- 2.0, ...) c) start the test
Ok, are you sure you want to say the confidence level? I would choose one and be done with it.
@stefano80 Yes I would certainly stick with 95%. And yes the conversion depends on the draw ratio. I'll write something.
FWIW Here is a script that converts between num_games and accuracy, given the draw ratio. I did a few spot checks and it appears to agree with fishtest.
Fixed an error. The script was only correct for elo_diff=0. I did not notice it because the dependence of the accuracy on the elo difference is rather weak (the accuracy is mainly influenced by the draw ratio and the number of games).
Thx, now we have to incorporate this in fishtest and then we can try it out.
I do not think this script needs to be integrated in fishtest perse. The idea of stopping when a certain accuracy is reached is independent from it (the code to compute the error bars of a running test is already in fishtest).
The script can however be used to judge what are reasonable accuracies to aim for (+-0.5 is unreasonable, +-1 could be sensible in extremely convoluted cases (~160000 games), +-2 is fairly reasonable in normal situations).
@vdbergh bump :)
When creating a new test, we have the option SPRT, NumGames or SPSA This is just some ideas for discussion.
NumGames are mainly used for regression
We can see a few results here (and elsewhere too) http://tests.stockfishchess.org/tests/user/gogamoga1
A typical result is expressed as follow ELO: 57.25 +-1.9 (95%)
How about a different stop criteria: stopping as soon as we are within some error bars (typically +-2.0) for some confidence interval (95%)
This could be faster and more useful than a NumGames, where we never know how many games are enough (is 20000 enough, 30000 or even 60000 ?)