[RFC] closed positions book.

vondele commented 4 years ago

I have made a pull request to the official book repo with a closed positions book. https://github.com/official-stockfish/books/pull/8 this still needs some testing, but should eventually be available.

I first want to do some testing comparing this to the noob_3moves book on fishtest before we possibly start using this, so that we have a feeling for its quality. My initial impression is rather good.

There are several options we can first discuss here before I decide on this.

Allow patches to be tested against this book, normal stc and ltc. Leave choice up to the submitter
test all patches against this book, just switch for a couple of weeks.
first retest a couple of patches that were aiming at closed positions but didn't pass.
etc.

MJZ1977 commented 4 years ago

This is a very good idea I had suggested many times before !

But just for me pe->blockedcount() >=4 is not enough. Many of positions of the book are not blocked (>80%). Can we add by hand some french and king indian positions and retrieve clearly open positions?

Edit: we can allow patchs with this book and test STC non regression with initial book.

vondele commented 4 years ago

@MJZ1977 , thanks! Some related observations/notes:

the positions in the book are from before the blocked position is reached, i.e. the position in the game out of which this position is extracted becomes more blocked as SF plays.
I hope that this allows for some variety still, and that improvements will come from both avoiding to get into blocked positions when not advantageous and from playing blocked positions well
Roughly only 1 out 50 games currently played games on fishtest matched the criterion 'blocked', so this is already 'a massive change' compared to the current state.
Adding by hand is not so easy, I had no way to get the ECO code of a game (fishtest games start from a fen nowadays, not from moves), and one needs ~50k different positions to make a reasonable book. I assume a few people have more advanced tools, and could contribute another book constructed with a different strategy.
Very narrow opening books (e.g. just French) might be a bit risky, overfitting could be lurking there.

NKONSTANTAKIS commented 4 years ago

Thanks for this exciting incentive!

Both strategies should be valid, the specialized one would indeed require a non-regression step. This is a versatile book with a stronger closed position signal, imo safe to use as normal book. Probably more universal, due to closed positions heavy underrepresentation in default. Distribution is evened out in regards to opening type instead of opening availability.

Another point is that for open positions search is a nifty tool, so its closed positions which need elements.

vondele commented 4 years ago

Influence of the book on Elo difference. noob_3moves.epd vs closedpos.epd. Basically, books have a similar Elo performance, for both SF10 - SF11, as well as SF11 - SFdev.

SF11 vs master (STC)

closed:

ELO: 17.94 +-1.7 (95%) LOS: 100.0%
Total: 60000 W: 13779 L: 10684 D: 35537
Ptnml(0-2): 880, 6085, 13460, 8210, 1365
https://tests.stockfishchess.org/tests/view/5ea415c913fcd4bb2f00a0e4

noob:

ELO: 17.91 +-1.7 (95%) LOS: 100.0%
Total: 60000 W: 13292 L: 10202 D: 36506
Ptnml(0-2): 814, 6166, 13525, 8106, 1389
https://tests.stockfishchess.org/tests/view/5ea415c913fcd4bb2f00a0e4

SF10 vs SF11 (STC):

closed:

ELO: 50.59 +-1.8 (95%) LOS: 100.0%
Total: 60000 W: 17819 L: 9143 D: 33038
Ptnml(0-2): 586, 4917, 12288, 9653, 2556
https://tests.stockfishchess.org/tests/view/5ea413e913fcd4bb2f00a0d3

noob:

ELO: 48.18 +-1.8 (95%) LOS: 100.0%
Total: 60000 W: 17306 L: 9038 D: 33656
Ptnml(0-2): 619, 5006, 12298, 9642, 2435
https://tests.stockfishchess.org/tests/view/5ea415ac13fcd4bb2f00a0e1

SF11 vs master (LTC, Edit: final values)

closed:

ELO: 20.12 +-1.8 (95%) LOS: 100.0%
Total: 40000 W: 7149 L: 4835 D: 28016
Ptnml(0-2): 211, 3221, 11101, 4977, 490 
https://tests.stockfishchess.org/tests/view/5ea45e85b908f6dd28f34ada

noob:

ELO: 17.45 +-1.7 (95%) LOS: 100.0%
Total: 40000 W: 6357 L: 4350 D: 29293
Ptnml(0-2): 224, 3109, 11590, 4590, 487 
https://tests.stockfishchess.org/tests/view/5ea45e72b908f6dd28f34ad7

I think this indicates that the book is pretty general purpose.

I will now reschedule a few of the recent yellow LTCs that presumably target closed positions with the new book

vondele commented 4 years ago

Can I ask authors of recent yellow LTC patches (e.g. @Vizvezdenec @xoto10 @locutus2 @MJZ1977 @Lolligerhans) that target closed positions to resubmit them LTC, with the new closedpos.epd book, putting closedbook in the info field as well? Looks like a few of them will need rebasing so I can't easily reschedule.

I've reschedule 2 that were based on current master: https://tests.stockfishchess.org/tests/view/5ea49685b908f6dd28f34b85 https://tests.stockfishchess.org/tests/view/5ea4969ab908f6dd28f34b87

locutus2 commented 4 years ago

I will retest with the closed book my pawn chain patches . I had three similiar version which all passed STC and failed LTC yellow.

Lolligerhans commented 4 years ago

@vondele I had no such patch. I kept track of yellows so I am pretty sure. :)

adentong commented 4 years ago

Unrelated to the current topic, but the last regression was only ~11elo, but @vondele's LTC tests are showing 18/20 elos respectively for closed book/noob book. I know we use a different book for regression, but still a bit surprising.

xoto10 commented 4 years ago

Very interesting results! Am i right in thinking this book is about the same size as noob_3moves ?

So we've used noob_3moves to play a lot of games, then sampled games we're interested in after 8 plies - is that 14 plies from startpos then? That might be a concern for long-term use as the standard book, but given the performance tests give very similar results to noob_3moves, I'm happy to test it out for a couple of weeks. Definitely a plus point to just update the main book instead of having a choice, and having to do non-regression tests against the main book, I just hadn't expected this to be an option. Interesting ...

Vizvezdenec commented 4 years ago

well side note that last RT has different master that was behind by 2 elo patches and one simplification. Also it's kinda expected I guess with 2 space/blocked positions interacting patches...

adentong commented 4 years ago

Yea well usually I wouldn't expect a 7-9 elo difference with just two elo gaining patches lol...

NKONSTANTAKIS commented 4 years ago

@adentong RT's use 8_moves book, which has the lowest elo spread (around 10% less). This makes the +50 elo between versions more meaningful. On top of that are the 3 patches, an undefined small effect of book optimization, and double error-bars.

vondele commented 4 years ago

I indeed wouldn't focus to much on the comparison to the RT, it is indeed not exactly the same version of the code, and the 8moves_v3 book is known to yield less Elo difference. The draw rate is slightly different with the books as well 8moves 0.74, noob_3moves 0.73, closedpos 0.70. This all looks good IMO.

There have been a number of tests overnight using the new book (on old yellow LTCs): https://tests.stockfishchess.org/tests/view/5ea49685b908f6dd28f34b85 https://tests.stockfishchess.org/tests/view/5ea4b95ab908f6dd28f34bde https://tests.stockfishchess.org/tests/view/5ea4a0dcb908f6dd28f34ba4 https://tests.stockfishchess.org/tests/view/5ea4969ab908f6dd28f34b87 https://tests.stockfishchess.org/tests/view/5ea4a14cb908f6dd28f34bab https://tests.stockfishchess.org/tests/view/5ea4a0efb908f6dd28f34ba7 none of them passed, and IIRC one yellow.... probably not too surprising.

So let's get the expectations right. The closedpos book is not a magic bullet, and it will remain a real challenge to get patches passed.

vondele commented 4 years ago

Based on the data collected, my proposal is to switch the default book to closedpos.epd relatively soon, used for essentially all tests (but not RT), and just continue testing as before. In particular, after passed STC and LTC tests on closedpos, PRs can be made, no need for additional non-regression tests. After a couple of weeks (June?) this strategy is reassessed.

Give thumbs up or down if you agree or disagree with this proposal.

locutus2 commented 4 years ago

@vondele I would prefer more to do a non-regression against noob book but more in the sense of monitoring to be alarmed if it goes really bad. Here we can probably use weaker bounds like [-2;0].

But the the best approach seems for me to do a mixed book: 50% positions from closed book and 50% positions from noob book. So we would have the best of two worlds: closed position testing but no overfitting to this type of positions IMO.

vondele commented 4 years ago

@locutus2 I plan to do the monitoring based on the usual 8moves RT runs.

My argument against doing additional non-regression tests is that I want to keep our procedure as simple as possible. I'm also pretty confident that regression are unlikely. But if there is a strong feeling in favor of the additional testing on passed patches, I'm fine with it. So, let's see what the vibes are.

I'm not in favor of mixing the books. Let's try to get a clean signal. Again, the book is not extreme, and there will be opinions going in either direction (e.g. @MJZ1977 would like to see it more closed, you prefer a little more open).

locutus2 commented 4 years ago

@vondele About the clean signal point: Ok i understand it from scientific standpoint it is good to get clean data about the closed book to asses it (here i'am with you). But its important how we go from there. Say the closed book seems good: take we then this further or mix it with par example the noob book (which till now also works). Here only the second one seems to avoid biased development and i think it is not good to go now from one extreme (unusual open positions) to another (near closed positions) so mixing up seems the best approach.

vondele commented 4 years ago

@locutus2 long term I can indeed see the point, and we can reassess.

Short term, let's figure out if the book actually matters much. I think this is an experiment to try and see if the perceived weakness in closed positions can actually be more easily fixed with a closed book (if one looks at the positions, it really is not that closed). We might find that this is not as important as we think.

This is in part an old discussion, the many years of development with the 2moves book, which really was not very sophisticated, illustrated that the book might not be the key ingredient to progress.

MJZ1977 commented 4 years ago

I think we can keep the 2 books for instance and change the default once we have the ideas clear. It will be interessant to find a patch that shows a big gap between the 2 books. Green to "closed book" and red to "noob book". Then we can conclude.

xoto10 commented 4 years ago

Last night I was thinking this was a big development ... now seeing the results of the reruns, it seems it doesn't make much difference at all. Perhaps there is a subtle change that we will become aware of over time. At the moment (very early of course), it seems the lower draw rate is perhaps the main change (benefit?) of this.

My main concern if we switch to using this book for the medium term remains the beginning of the game. If we want sf to get better at the early moves, surely we need a test book that includes small ply openings (say 0-5) as well as longer ones?

miguel-l commented 4 years ago

The way I understand it is that we get positions which, in its games Stockfish closes the position (please correct me if I misunderstood something). But what about games that Stockfish fails to close the position? For example, when searching from root, very commonly we see the exchange French, etc. Something feels off about it.

NKONSTANTAKIS commented 4 years ago

I believe that the beginning of the game is too vague to be helped by eval, due to very high availability of viable options and different setups. But as the midgame eval becomes more accurate, it will show at openings via better steering of search.

This book should not be regarded as a specialized closed position book, but as an attempt for a more balanced general book in regards to position type. The conditioning is soft and leads to open positions too. The problem with typical books is that they are balanced in regards to viable opening availability, thus tiny signal of truly closed positions. SF has problem with those for 3 reasons:

Rarity of occurence, as explained
Vastly different characteristics
Inefficiency of search (as their long-term nature, where 1 pawn move can ruin the prospects forever, entering a distant dead-end)

Search inefficiency (and unfortunate setup selection) has partly to do with seeking generically favorable evals: A highly valued bonus in a static position acts like a black hole for the search. It sucks up all the resources to that direction, because it "believes" its something supreme, blinding it for alternatives. An example is a very deep knight outpost at totally blocked flank + space advantage. Totally useless at a glance for chess players, but SF aiming for it form early game even.

Removing those black-holes completely will require "alien" tech like pattern-recognition, MCTS, NN, or a detailed categorization of cases. But an increased representation of black-hole situations will surely boost long-term health.

I don't believe SF needs training at positions that are very easy for it, nor is it in danger of regressing. At tactical cases the various paths are narrow and concrete and search shines.

xoto10 commented 4 years ago

But what about games that Stockfish fails to close the position?

Good question. I guess there will be a few d4/e5 French advance structures in this book, perhaps this can be an iterative process and the book can be recreated occasionally? If we can improve sf's blocked position play a little, then it will choose more blocked positions ... then we can improve it's play a little more ... etc

Edit: or we could just get some games from somewhere else, no reason to only use fishtest? e.g. http://data.lczero.org/files/match_pgns/1/

vondele commented 4 years ago

I believe there have been some valid concerns raised in this thread, enough so that we should consider alternatives. I have now built a new book with a very different approach based on these comments. I'll again do some testing on fishtest later. The major concerns I have seen raised are:

balance between closed and open lines (e.g. closedpos.epd vs noob_3moves.epd)
need for short lines (2moves, noob_3moves)
need for long lines (8moves)
presence of particular openings like french advance, KID, etc (8moves)
absence of 'strange/rare openings' (2moves, noob_*)
Elo resolution

To address this, I made a book based on the frequency of FENs in games played at lichess (restricted to Elo > 1800, TC > 60). I retained the 200k most frequent FENs out of >8M games. (see https://github.com/official-stockfish/books/pull/9)

This have the following advantages:

lines closed and open are balanced, reflecting human choice
short lines are present (e.g. startpos is the most frequent position)
long lines are present (i.e. popular deep lines are played relatively often).
has all named openings
'strange/rare' openings are absent or a very small fraction (e.g. no grob in the top 200'000)
Elo resolution needs to be measured on fishtest.

Of course, the choice of the initial database will somewhat influence the resulting FENs, but I think that's more or less secondary.

Edit: the Elo testing yielded the following:

SF11 -> master (STC)
 ELO: 11.89 +-1.6 (95%) LOS: 100.0%
Total: 60000 W: 13791 L: 11738 D: 34471
Ptnml(0-2): 763, 6016, 14647, 7553, 1021 
https://tests.stockfishchess.org/tests/view/5ea7e0a953a4548a0348ecb1

SF11 -> master (LTC)
ELO: 14.61 +-1.6 (95%) LOS: 100.0%
Total: 40000 W: 7331 L: 5650 D: 27019
Ptnml(0-2): 181, 3045, 11987, 4486, 301 
https://tests.stockfishchess.org/tests/view/5ea7e0d653a4548a0348ecb5

SF10 -> SF11 (STC)
ELO: 43.35 +-1.7 (95%) LOS: 100.0%
Total: 60000 W: 17566 L: 10119 D: 32315
Ptnml(0-2): 531, 4776, 13411, 9279, 2003 
https://tests.stockfishchess.org/tests/view/5ea7e0c353a4548a0348ecb3

So the Elo spread is somewhat small on this book.

Anybody has a pointer to another pgn database of high quality games (e.g. master level, ICCF), but it will need to be > 2M games to be suitable to build a book, I would say.

Alternatively, a subset of high quality leela training games (again >2M) ?

xoto10 commented 4 years ago

noob_2/3moves books were selected to avoid drawish openings IIRC, but the closedpos book just turned out to have a good Elo spread without any explicit drawish checks. (I wonder why?)

Do you have any info on how many of these popularpos lines qualify as closed under the closedpos tests? Maybe we need a not-drawish test if we want to consider these popular and more open lines?

vdbergh commented 4 years ago

noob_2/3moves books were selected to avoid drawish openings IIRC,

No they were not. In fact their draw ratio is rather high. Note: for the same Elo you want the highest possible draw ratio (= least amount of noise). It you want to lower the draw ratio convert every draw into a win or loss using a coin.

vondele commented 4 years ago

I ran a second test on a book popularpos_lichess_v2.epd which was contructed retaining games from >2200 Elo players only. The result, however, is nearly identical:

 ELO: 43.41 +-1.7 (95%) LOS: 100.0%
Total: 59896 W: 16875 L: 9430 D: 33591
Ptnml(0-2): 492, 4789, 13408, 9300, 1959 
https://tests.stockfishchess.org/tests/view/5eab03cb09d25e8e5058169b

the noob_3moves book was not selected specifically to avoid drawish openings, but it might be a side effect of how the database has been constructed.

noobpwnftw commented 4 years ago

My books were built from one simple rule: pick moves that are top N and not worse than a score threshold. I find it interesting that the result converges with a book built with human games.

vondele commented 4 years ago

I did a quick analysis (depth 13) of the score of the book moves, and that highlights quite some difference between the 2 classes of books: opening_book_score basically, the human games, even in these 'popular positions' have a much broader range of scores, i.e. essentially won or lost. This improves only very little with Elo of the players. I think the main problem is that these human games are mostly very short TC (>60s, but typically 180s). So, if anybody has a clean database of long TC games between good players...

vdbergh commented 4 years ago

the human games, even in these 'popular positions' have a much broader range of scores, i.e. essentially won or lost

Yes the RMS bias is around 90. See https://tests.stockfishchess.org/tests/stats/5eab03cb09d25e8e5058169b . Comparable to the 8moves book which is also derived from human openings IIRC.

vdbergh commented 4 years ago

Sorry I misremembered. The RMS bias of the 8moves book is around 60. It was the 2moves book that had an RMS bias around 90 (showing that biased openings are not necessarily bad).

By comparison the RMS bias of the noob_3moves book is around 30.

NKONSTANTAKIS commented 4 years ago

1 node Leela is around 2500 elo (on big SV nets - dense knowledge). How about trying a book based on eval divergence to SF search ply x? It should be very rich of SF blindspots.

vondele commented 4 years ago

So, I tried a 3rd book based on 'popular positions', namely ranking them with their frequency on lichess times the frequency on fishtest (LTCs games based on noob books only). The result is human style positions that stockfish would play as well. I think it is a nice book, with mostly openings found in master level games as well, no blunders, lots of nice opening lines etc (@MJZ1977 you might want to check). It has book move scores very similar to those of noob books: opening_book_score

yet, the Elo spread remains low, in fact very similar to the previous two versions:

 ELO: 43.77 +-1.8 (95%) LOS: 100.0%
Total: 59940 W: 16150 L: 8638 D: 35152
Ptnml(0-2): 590, 4919, 12901, 9509, 2051 
https://tests.stockfishchess.org/tests/view/5ead5bab6ffeed51f6e3257e

So, good Elo spread seems to be a different property..

vdbergh commented 4 years ago

@vondele On top of that all your books seem to have lower draw ratio than noob_3moves.epd. So for SPRT they would be even less efficient than what the Elo spread suggests....

Draw ratio, RMS bias and Elo spread seem to be independent properties of a book which seem to be impossible to predict. Very strange.

NKONSTANTAKIS commented 4 years ago

Imo Elo spread is the most important and closely linked to the sustained (thus mainly positional) complexity of a position, providing room for outplay. Initial complexity with tendency to suddenly resolve to any outcome might reduce draw-rate but with high randomness.

vondele commented 4 years ago

@NKONSTANTAKIS how does that sustained complexity assumption match the great Elo spread of the noob_3moves book ?

Edit: many more complex positions in the popular pos v3 book.

noobpwnftw commented 4 years ago

I have a few theories, in order of likelihood: 1) My scores of the leaf scores are more accurate, and I fiddled back propagation of scores based on weighted averaging of top N moves within a certain range. 2) No selection bias for what move to play(human games have strong preference from opening theories). 3) I cut lines that had bad moves from both sides but the position ended up being balanced.

NKONSTANTAKIS commented 4 years ago

@vondele I would not say great, but good elo spread. noob_4moves has a bit higher and drawkiller the highest so far counted. I think this is mainly due to the max eval filtering of 1-sided position tendencies, which favor the weaker engine. If the presented problem is simple enough for a 3200 elo to solve, how to differentiate to a 3400 elo? Basically @noobpwnftw 1. , even more powerful.

Another filtering which would help is equal but dry positions, which are low on randomness but also low on sensitivity. Those would probably mean that usually one side has to blunder, to not be draw. This kind of positions might be suitable for lower TC, and lose value as quality rises. But how to define "dry"? As a chess perspective hard to apply it to a database, but I think a pretty accurate signal would be the % of draw in results, especially when disproportionally rising with TC.

The closed book should have lower % of this kind, having higher elo spread at LTC.

I explain the highest spread of drawkiller to the artificial asymmetric re-arrangement of pieces + the existence of all pawns. Pawns increase the long-term strategic complexity, as irreversible decisions.

I have 2 ideas as propositions for elo-spread testing:

Condition for existence of all 16 pawns at noob_3moves or better noob_4moves.
Pure 960 bookless, preferably with sequential (non-random) distribution.

Drawkiller might be a bit dangerous, as too specifically lopsided, but 960 from the get go should help a wide and unbiased understanding of chess. When the openings are too similar it creates overfitting, while 960 might create underfitting, with lowered correlation to normal chess. An interesting experiment nonetheless.

noobpwnftw commented 4 years ago

If I'm to make drawkiller-alike books, I might take a set of positions and for each position, play a dozen master vs master games and remove the ones that have high draw rate, until I get a set of positions that satisfies certain criteria. Whether such books are beneficial however is another question.

The real question is how to come up with the initial set of positions that are truly neutral and more meaningful than random play or FRC, not how to introduce whatever specific selection bias you'd like on top of that.

NKONSTANTAKIS commented 4 years ago

Neutral is good but it also has to be meaningful...dry is bad. Just lowering draw rate would probably worsen elo spread, but "drawkiller" (a misleading name) improves it by aiming for strategic complexity.

I think for drawishness there should be a threshold: 50% draw might be too little, but 90% too high. The 1st signal will be strong and noisy, the 2nd accurate but weak. Your method seems to have taken good care of the one-sided part of extremes, how about targeting the other by cutting out like the top 10% of draw dominance?

vondele commented 4 years ago

@noobpwnftw in principle we have most of the the data. There is millions of LTC games for the noob_3moves book, so in principle ~1000 games per position in the book. Not truly master vs master, but close. So, one could start such an analysis. Not sure that one needs to avoid a high drawrate (see comment by @vdbergh above). I guess one needs a high drawrate, and a large difference in 1-0 / 0 - 1 outcome.

noobpwnftw commented 4 years ago

The engine has to know how to draw those positions, this is also a part of their performance. I think the misconception is that people want to remove such seemingly trivial positions to increase elo resolution, this doesn't work. Those positions are the most common ones and they are only a few moves in, a lot of things can happen that contributes to the final outcome.

noobpwnftw commented 4 years ago

OTOH, I think a good filter for whether a position is "complex" or "closed" is time to depth(aka. pv stability). Would be interesting to see how that results.

NKONSTANTAKIS commented 4 years ago

Something trivial on high depth might indeed not be trivial at short depth. So results vary with TC. But if a position is too much forgiving, its a problem. Forced lines that lead to dry simplifications cant be productive.

vondele commented 4 years ago

@noobpwnftw time to depth was a criterion for the closedpos book in some sense (go perft 5 being a small number).

noobpwnftw commented 4 years ago

I mean what if we build a book that contains only positions which the search is taking longer to reach certain depth, those positions should be quite complicated and might give good performance resolution. Raw perft does not reflect pruning behavior, I think closed positions have a lot of moves that cannot be trivially pruned.

vondele commented 4 years ago

I actually have those numbers... let me check.

vondele commented 4 years ago

so average number nodes needed to reach depth 13:

book	nodes
noob_3moves	81385
closedpos	123145
popularpos	113054
popularpos_v2	111785
popularpos_v3	115037

noobpwnftw commented 4 years ago

Weird, so the theory is right, but the result went the opposite...

dorzechowski commented 4 years ago

Out of curiosity I checked depth 13 nodes in 2moves_v2 book. The book is relatively small (12k positions) so I analyzed whole book. The average is 134673 and histogram looks like this: 2moves_v2_depth13_nodes_histogram

Perft 5 nodes vs depth 13 nodes scatter plot looks like below. There is no correlation at all (R=0.14). nodes_d13_p5_2moves

Position with max depth 13 nodes (385505): rnbqkbnr/p1pp1ppp/1p2p3/8/3P4/4P3/PPP2PPP/RNBQKBNR w KQkq -

Position with min depth 13 nodes (28154): rnbqkbnr/p1pp1ppp/4p3/1p6/5P2/2N5/PPPPP1PP/R1BQKBNR w KQkq -

All with latest SF (2 May 2020).

official-stockfish / Stockfish

[RFC] closed positions book. #2646