official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
281 stars 129 forks source link

New stop rule #137

Open Rocky640 opened 6 years ago

Rocky640 commented 6 years ago

When creating a new test, we have the option SPRT, NumGames or SPSA This is just some ideas for discussion.

NumGames are mainly used for regression

We can see a few results here (and elsewhere too) http://tests.stockfishchess.org/tests/user/gogamoga1

A typical result is expressed as follow ELO: 57.25 +-1.9 (95%)

How about a different stop criteria: stopping as soon as we are within some error bars (typically +-2.0) for some confidence interval (95%)

This could be faster and more useful than a NumGames, where we never know how many games are enough (is 20000 enough, 30000 or even 60000 ?)

Stefano80 commented 6 years ago

Hi rocky, sounds like an interesting idea. Could you describe me a case in which you would like to use this stopping rule?

Rocky640 commented 6 years ago

Apart from regression test, some "comparison, or milestone tests" such as http://tests.stockfishchess.org/tests/view/5a6ac83d0ebc590d945d59b3 could finish faster.

Also, often before trying to improve a feature, we remove the feature and need a elo mesure. Is this feature worth 5 elo, 10, 30...or just 1 ? Is it worth the trouble to spend 10 tests to tweak it ?

One just need a quick measure in the +- 2 interval with a 90% confidence.

We can also get some elo measure running a sprt and use a tool such as http://hardy.uhasselt.be/Toga/live_elo.html but this is not exactly the same thing

Stefano80 commented 6 years ago

I see what you mean. Maybe a first iteration would be to add a link to the correct page of live_elo on the test page, such that it is even easier to use it.

vdbergh commented 6 years ago

This proposal makes a lot of sense. Fixed length tests are typically used to make an elo measurement (if you want a binary answer then the SPRT is more efficient). So instead of asking for the length of the test it makes much more sense to ask for the desired accuracy.

I am sure this is theoretically ok. However it would be best to provide some canned tests (like in the SPRT case) with reasonable ressource consumption. Otherwise people will ask for an accuracy of +-0.5 elo ... If there is interest I can hack a quick python script to make the conversion length <-> accuracy (it is a trivial computation so anyone can of course do it).

Stefano80 commented 6 years ago

The conversion between length and b accuracy actually depends on draw rate, so it is not that trivial.

The new stop rule would just be to stop the test once accuracy (95%) reached a given level, right?

This sounds very good!

Rocky640 commented 6 years ago

to be clear the way I see it is a) user would set a confidence level (95%, 90%, ...) b) user would request some error bars (the "accuracy") (+- 1.0, +- 1.5, +/- 2.0, ...) c) start the test

Stefano80 commented 6 years ago

Ok, are you sure you want to say the confidence level? I would choose one and be done with it.

vdbergh commented 6 years ago

@stefano80 Yes I would certainly stick with 95%. And yes the conversion depends on the draw ratio. I'll write something.

vdbergh commented 6 years ago

FWIW Here is a script that converts between num_games and accuracy, given the draw ratio. I did a few spot checks and it appears to agree with fishtest.

http://hardy.uhasselt.be/Toga/resolution.py

vdbergh commented 6 years ago

Fixed an error. The script was only correct for elo_diff=0. I did not notice it because the dependence of the accuracy on the elo difference is rather weak (the accuracy is mainly influenced by the draw ratio and the number of games).

Stefano80 commented 6 years ago

Thx, now we have to incorporate this in fishtest and then we can try it out.

vdbergh commented 6 years ago

I do not think this script needs to be integrated in fishtest perse. The idea of stopping when a certain accuracy is reached is independent from it (the code to compute the error bars of a running test is already in fishtest).

The script can however be used to judge what are reasonable accuracies to aim for (+-0.5 is unreasonable, +-1 could be sensible in extremely convoluted cases (~160000 games), +-2 is fairly reasonable in normal situations).

ppigazzini commented 2 years ago

@vdbergh bump :)