official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
11.56k stars 2.27k forks source link

Discussion for LTC status quo #3197

Closed NKONSTANTAKIS closed 3 years ago

NKONSTANTAKIS commented 4 years ago

IMO NNUE is very fresh and offers tons of room for improvement. But LTC seems to have become almost impossible to pass.

This can be attributed to the narrow window for proving superiority: as the drawrate is too high, the improvements have to be immense to pass. Or to put it differently, how to reward playing better when worse play is adequate in most of the cases?

I have some ideas, which I present for consideration and assessment by the engineers.

  1. To use time odds and reverse. So 4 games per opening.

This will immensely increase the window by lowering drawrate, reward the exploitational improvements + defensive capabilities. It will also act as a form of contempt via promoting stuff that beat weaker opposition. Because Chess is so drawish that its easier to hold than win.

The rest of my ideas I will reveal gradually, as I suspect that a lot at once is unappealing.

vdbergh commented 4 years ago

Personally I think there is not enough evidence to say that the high LTC draw ratio is harmful. The high draw ratio saves us an enormous amount of ressources for a fixed Elo resolution (this can be seen with the SPRT calculator). The drawback is that there might also be Elo compression so that a higher resolution would be needed for patches to pass (requiring more ressources). This elo compression (if any) should be measured first before any conclusions can be drawn.

syzygy1 commented 4 years ago

Any change that converts pairs of draws into pairs of win and loss would only make it more difficult to decide which of two engines is better. For example, 20 wins, 2 losses, 978 draws is a much stronger indication of superiority than 220 wins, 202 losses, 578 draws.

If draw ratio is indeed higher than before, it could make sense to lower the Elo bounds for a patch to pass to counter the Elo compression. (But I have no idea whether the SPRT mathematics might already take this into account automatically.)

vondele commented 4 years ago

We have lowered SPRT bounds a while ago after the NNUE merge. Even though the LTC bounds are now {0.25, 1.25} it is not so easy to pass at ~91% draw rate. The draw rate already increased quite a bit from just after the nnue merge ~87% (we've made the engine clearly stronger in this time) to now. If we want to keep the expected maximum number of games manageable (say 130k), probably we could change to {0.25, 1.00} once the draw rate hits 93% or so. Gaining ~1 Elo today makes the engine significantly strongly, which is clearly visible in the W/L ratio.

vdbergh commented 4 years ago

A side comment: there is a mathematical solution to keeping resource usage constant, i.e. independent of the draw ratio, or the book. Bounds should be expressed in normalized Elo (i.e. in multiples of the standard deviation of a game pair, which can be dynamically measured). The disadvantage however would be the same as with the BayesElo bounds we abandoned. Such non standard ways of expressing bounds are not intuitive. Ultimately people like to think about standard (logistic) Elo, with all its flaws.

vondele commented 4 years ago

I've always found logistic Elo quite 'counter-intuitive'. I like to think about W/L ratios... for me W/L > 1.0X is a strong patch (definitely with X>=5, but probably already with X=1 or 2)... even better if using the pentanomial results (1.5 - 0.5 / 0.5 - 1.5)-ratio. It was the latter thing I referred to in the latest release message, explicitly not using an Elo result.

vdbergh commented 4 years ago

@vondele Personally I think of W/L ratio (=discarding draws) as being just as adhoc as all other Elo measurements except normalized Elo. IMHO there is only one "objective" measure for the strength difference between two engines, and this is the amount of effort (games) it takes to prove (with a given level of confidence) that one engine is stronger than the other. This leads to the definition of "normalized Elo".

Normalized Elo is objective for fixed testing condition but as all Elo measurements it still depends on the testing conditions (e.g. the book and the TC). This leads to the notion of (relative) sensitivity of the testing conditions.

vondele commented 4 years ago

yes, presumably normalized Elo is the right quantity to think about... just, can you point to a link that defines it, I need to refresh my memory?

I think the 'playing with sprt bounds' is actually going in that direction, since we have a natural number of games that seems reasonable (say 100k right now) to prove a patch is better than master.

vondele commented 4 years ago

OK found back the link to normalized Elo, posting here for future reference : http://hardy.uhasselt.be/Toga/normalized_elo.pdf

vondele commented 4 years ago

Actually, there is some link between normalized Elo and W/L ratio, which at least allows to 'compare patches on fishtest'.

If we look at the ratio of the normalized Elo of the two patches tested against a master (patches named a and b), under the assumption that w ~ l << d, and that both patches have a similar d (thus similar w,l), we can see that that t_0^a / t_0^b ~ (W^a / L^a - 1) / (W^b / L^b -1). So, if patch a has 4% more wins and patch b has 2% the normalized Elo of patch a will be twice that of patch b. (basically, the ratio of Eq. 2.8 in the above reference, with suitable simplification based on the assumptions and ignoring higher order terms).

Obviously, that doesn't change the fact that the normalized Elo is the more solid concept, but rationalizes looking at W/L ratios.

xoto10 commented 4 years ago

Any change that converts pairs of draws into pairs of win and loss would only make it more difficult to decide which of two engines is better. For example, 20 wins, 2 losses, 978 draws is a much stronger indication of superiority than 220 wins, 202 losses, 578 draws.

Sure. Under fishtest conditions results will be much less extreme than this, but I agree, draws must be important. e.g. 220 wins, 202 losses, 578 draws must be different from 220 wins, 202 losses, 9578 draws, although I must admit I have no idea which shows the clearest advantage without looking at a calculator.

I think my concern is that a lower number of decisive games allows more randomness into the result / requires a larger number of games to avoid this randomness. Or at least it seems like that to me, I could be wrong. e.g. with 91% draws we get ~90 decisive games in 1000. If we require 4% extra wins for a pass that is around +46 -44 which clearly needs many 1000's of games to reach confidence. If the draw rate is only 82% we have twice as many decisive games per thousand which may allow a decisive result in a much smaller number of games. This assumes only a small number of double wins for one color of course, which could be the practical flaw in this line of thinking.

xoto10 commented 4 years ago

Hmmm, I guess the problem with my argument is that I am assuming that if we increase the rate of decisive games from 10% to 20%, then the value of w-l will double. e.g. +46-44=910 will change to +92-88=820. This could be completely wrong, maybe the value of w-l would stay the same, giving +91-89=820, giving no advantage from an increased w-l. Oh well ... :-)

Edit: And indeed I started by believing that w-l corresponds to the Elo gain, which a calculator suggests is true, so more decisive games might not/probably doesn't lead to a larger value for w-l, we likely just get the same difference with larger w and l. ... And vdbergh tells us that a high draw rate is actually good for getting a faster result from sprt. So, even though it seems counter-intuitive I'm starting to believe that a high draw rate is fine :-) Given the great minds here that have been saying this for a while perhaps I should have been convinced sooner :-)

syzygy1 commented 4 years ago

Going from 46-44-910 to 91-89-820 would in fact be a distinct disadvantage.

More important than the estimated Elo difference is the error margin. If skewing games helps to "increase" the Elo gain of a patch (of course changing the test conditions cannot actually increase the real Elo gain of the patch), this does not help at all if you are then less certain that there is a positive gain at all.

I guess skewing games might help if the skew is so light that it helps the stronger engine win more games without making it lose more games. I don't know how realistic that is when you're trying to measure a 0.5 Elo improvement. If 1% faster means a 1 Elo improvement, a skew of 1% more time for white would already seem to bring things completely out of balance.

NKONSTANTAKIS commented 4 years ago

Lets imagine having the resources for testing patches at TCEC TC and having 96% drawrate. What would we do then in order to keep improving?

I think performance at "normal" chess is becoming a too easy task. Another perspective is that the engine should solve every chess position, regardless if it translates into elo in normal positions.

Stuff could gain elo at a set of very hard positions while being totally neutral on normal ones.

So a plan could be to make a highest resolution set, without caring(much) for positional balance/uniformity, and experiment to figure its correlation with normal book performance.

To explore the prospects of abusing extreme resolution.

NKONSTANTAKIS commented 4 years ago

For easily implemented adjustments, vondele's proposed (0.25 1.00) I think its needed for catching smaller elo gains.

With NNUE and this drawrate the noise is lower than ever, so the elo metrics are low resolution but accurate. The style is positional and random blunders happen much less than with contempt HCE.

This imo enables a safe lowering of the 0.25 to 0.2 or 0.15 for alleviating gamecount increase.

Alayan-stk-2 commented 3 years ago

Any change that converts pairs of draws into pairs of win and loss would only make it more difficult to decide which of two engines is better. For example, 20 wins, 2 losses, 978 draws is a much stronger indication of superiority than 220 wins, 202 losses, 578 draws.

Fishtest is using pentanomial statistics. What matters is the pentanomial result, so 1-1 with WL doesn't make SPRT last longer than 1-1 with two draws. Now, if you increase the number of 1.5-0.5 or 2-0 pairs equally for both sides, you do lose confidence.

syzygy1 commented 3 years ago

@Alayan-stk-2 Indeed, I suspect it would be difficult to avoid increasing the number of 1.5-0.5 outcomes for both sides. But maybe I am wrong about this. You are right that looking at pentanomial results at least "protects" against 1-0/0-1 games due to too much skew by cancelling them out.

vdbergh commented 3 years ago

@syzygy Since repeating games with opposite colours cancels out the noise coming from the bias in the positions in the book, the quantity that becomes relevant is the “RMS bias”. This is the square root of the average of the squares of the biases. The RMS bias is reported on the raw statistics page. It is about 30 Elo for the current book (remarkably it is very constant).

The RMS bias + the draw ratio (which gives the trinomial variance) allows one to compute the variance of a test result through the “accounting identity”. http://hardy.uhasselt.be/Fishtest/accounting_identity.pdf. This is used in the SPRT calculator.

Vizvezdenec commented 3 years ago

https://tests.stockfishchess.org/tests/view/5faaf6c667cbf42301d6a7dc somewhat of a relevant patch. Most likely will finish in 0 elo zone - drawrate is almost 94%.

syzygy1 commented 3 years ago

@Vizvezdenec It failed after only 20,000 games, which seems pretty good.

vondele commented 3 years ago

we have switched to normalized Elo on fishtest, which takes draw rate into account. I think that addresses a central part of this issue.