[RFC] closed positions book.

vondele commented 4 years ago

I have made a pull request to the official book repo with a closed positions book. https://github.com/official-stockfish/books/pull/8 this still needs some testing, but should eventually be available.

I first want to do some testing comparing this to the noob_3moves book on fishtest before we possibly start using this, so that we have a feeling for its quality. My initial impression is rather good.

There are several options we can first discuss here before I decide on this.

Allow patches to be tested against this book, normal stc and ltc. Leave choice up to the submitter
test all patches against this book, just switch for a couple of weeks.
first retest a couple of patches that were aiming at closed positions but didn't pass.
etc.

noobpwnftw commented 4 years ago

It makes sense now, elo spread is related to the percentage of positions contained in the book may be reached by playing SF topN moves. This is why closedpos had a good spread but popularpos didn't.

dorzechowski commented 4 years ago

I'm not sure. For example book 2moves_v1 contained basically random sequences of moves and had the same spread as noob_3moves. We measured it end of December and results were as below. Looks like books constructed differently and even with vastly different RMS bias may give the same sensitivity.

book	Elo spread	draw ratio	RMS
2moves_v1	44.50	0.513	73.85
noob_3moves	44.90	0.566	31.47
noob_2moves	40.75	0.562	33.02

noobpwnftw commented 4 years ago

Well as for 2moves there are just 2 moves, so pretty much anything not losing a pawn's worth is within topN, and it did remove some outright bad moves.

dorzechowski commented 4 years ago

I added noob_2moves to the table above. Both 2moves books have very little in common it seems.

Actually I want now to test hypothesis that positions with bigger depth 13 nodes are more complex. I'm going to sort 12k positions from 2moves_v2 by depth 13 nodes, split it in 3 equal parts and then use 1st and 3rd part as a new books to play 8000 games matches between SF11 and SF10. If it's true that bigger node count mean more complexity, then book made from 3rd part should give significantly bigger spread than the first one. It would be interesting to either confirm or debunk it. Unfortunately I have only a measly laptop, so it may take some time before I get back with the results.

noobpwnftw commented 4 years ago

The difference between my 2moves and 3moves book are just making one move that is not too bad and my scores are back propagated, but still I think coverage ratio among topN matters, spread of 2moves_v1 might because of higher RMS matters only for a few moves in but not more.

vondele commented 4 years ago

I have #W # L #D (White POV) for the noob_3moves from fishtest LTCs. Typically looks like:

  "rn1qkbnr/ppp2ppp/3p4/4pb2/2PP1P2/8/PP2P1PP/RNBQKBNR w KQkq -": [
    59,
    48,
    215
  ],
  "rnbqkb1r/pp1pppp1/2p2n1p/8/3P1P2/8/PPPBP1PP/RN1QKBNR w KQkq -": [
    38,
    27,
    186
  ],
  "rnbqkbnr/2pp1ppp/1p6/p3p3/8/3P4/PPPNPPPP/1RBQKBNR w Kkq -": [
    25,
    44,
    233
  ],
  "rn1qkb1r/pbpppppp/5n2/1p6/8/PP4P1/2PPPP1P/RNBQKBNR w KQkq -": [
    39,
    35,
    226
  ],

So, openings appear winnable from both sides. I don't directly see a pattern. @vdbergh do you think that this data be used to select good positions for a book ?

bookstats_noob_3moves.json.zip

NKONSTANTAKIS commented 4 years ago

A lot of 150K-350K eval yellows recently. Maybe check them on closedpos? I am thinking its getting harder and harder to get 1 elo with a single patch. As most of those should be around +0.5 to +1.3, I like the idea of a standardized decider. Different environment + excellent spread scaling of book...how about at a bit higher LTC? It feels wasteful to throw them away after having spent so many LTC games. The higher the game count, the closer they are to +1. Well probably around 0.9, due to selection bias.

Also with too many tests + low success rate, eventually some will pass out of luck. With a closer examination of the best performers the harvesting will be safer.

Atm it seems to me that too many resources are used on an extreme amount of different versions on very low pass rate, and thus a higher confidence would be logical.

noobpwnftw commented 4 years ago

closedpos will not make them pass, the LTC bounds are very narrow, it is expected to take large number of games to resolve for patches fall within this elo diff range. This is the price to pay so that less patches pass by luck. Low success rate and too many similar tests cannot be solved by lowering the bar while I'm colorblind so that I cannot tell the difference between a yellow and a red SPRT test.

dorzechowski commented 4 years ago

@vondele I think we could calculate SNR of each book position by normalized Elo formula or just check z=(w-l)/sqrt(w+l) and get rid of positions with z close to zero as they don't give any signal. But it would be also good to get confirmation from @vdbergh of course.

NKONSTANTAKIS commented 4 years ago

@noobpwnftw I want less patches to pass by luck, not more. Atm the pass rates are extremely low, but the amount of tested patches is huge, so inevitably the quality decreases & resources are wasted. For colorblind purposes the yellow can be regarded as red without lowering the elo bar but with an even higher amount of games. A higher spread will enable better performance.

closedpos had equal spread at STC but +2.7 at LTC, a very good indication.

So it might not make them pass as you say, but it can make them fail faster!

noobpwnftw commented 4 years ago

I hope so but with the large number of games their elo measurement is actually very accurate, they do fall around +0.5 range and they would still cost similar resources to conclude, and book probably won't change that. In fact, if it does, then I see trouble.

NKONSTANTAKIS commented 4 years ago

Well at this point maybe even a +0.5 at worst is nice. Using millions of LTC games for little gain feels ineffective. What if without you? I also think that testing many versions of same patches with slight changes is bad practice. One might get lucky in the end, worth 0.5, but at a very high price. The beast needs to be fed I guess...so why not to get our +0.5 in a smarter way?

Btw I like the system more than ever, but I think its very beneficial to keep evolving it, not only SF.

noobpwnftw commented 4 years ago

For that then I think it is important to understand how to manipulate elo spread.

This is my scored list of all unique positions after 2 moves without any filtering: https://www.chessdb.cn/downloads/2moves_scores.zip

I think I have calculated scores for any position up to 4 moves but the data is quite large.

vondele commented 4 years ago

@noobpwnftw could you make that scores data available for 3moves ? Either all if less than a few GB, or just for the positions in the noob_3moves book ? That will be interesting to correlate with ' z=(w-l)/sqrt(w+l)'

vondele commented 4 years ago

snr out apart for a 'feature' near zero (not sure where this is coming from), the distribution of (w - l) / sqrt(w + l) is very Gaussian for the noob_3moves book. This could be because the limited statistics for each of the openings? Might nevertheless be interesting to try in split the positions in two sets.

vondele commented 4 years ago

So, I locally did a test, splitting the noob_3moves according to the abs( (w-l) / sqrt(w+l)) > 0.167 (roughly 1 sigma), and there is no measurable difference (60k games) between the low and high parts of the book. So I start suspecting the broad Gaussian is just the noise, and the feature near 0 is the signal.... this is using the results of 44M LTC fishtest games using the noob_3moves book.

noobpwnftw commented 4 years ago

@vondele Full scores of positions after 3 moves: https://www.chessdb.cn/downloads/3moves_scores.zip

vondele commented 4 years ago

Interesting distribution of the scores of all positions after 3 moves...

3moves_scores

noobpwnftw commented 4 years ago

The feature around -15 and 0 are probably caused by the way I calculate things, might actually be smooth but doesn't matter when you sample moves with a wider range.

dorzechowski commented 4 years ago

No difference in my tests between book created from positions with low or high node count on depth 13 (TC 10+0.1). Low: Score of Stockfish_11 vs Stockfish_10: 2296 - 1236 - 4468 [0.566] 8000 High: Score of Stockfish_11 vs Stockfish_10: 2276 - 1276 - 4448 [0.562] 8000

vondele commented 4 years ago

so, with https://github.com/official-stockfish/Stockfish/pull/2670 we have a first patch that resulted from the closedpos book. Let's call this a success :-)

I don't think we have particular evidence to change the default book, but I'm sure we now know that we still don't know quite a few things about opening books.

I'll thus close this issue, keeping noob_3moves the default book. The other books can be used as non-default books, either for experimenting or to create Elo gainers, but we'll test patches for non-regression against noob_3moves to gather experience with this setup, asserting that we prefer generic solutions rather than specialized ones.

xoto10 commented 4 years ago

See also: https://tests.stockfishchess.org/tests/view/5eb1e2dd2326444a3b6d33f9 #2662 :) Although the stc was with noob_3moves, don't remember why I made those choices. Probably intended to use closedpos with the stc but forgot to set it, then made sure I did for the ltc.

vondele commented 4 years ago

OK, I overlooked that... should have been in the PR a little more clearly ;-). Extra credit for the book.

official-stockfish / Stockfish

[RFC] closed positions book. #2646