Closed vondele closed 4 years ago
This is a very good idea I had suggested many times before !
But just for me pe->blockedcount() >=4 is not enough. Many of positions of the book are not blocked (>80%). Can we add by hand some french and king indian positions and retrieve clearly open positions?
Edit: we can allow patchs with this book and test STC non regression with initial book.
@MJZ1977 , thanks! Some related observations/notes:
Thanks for this exciting incentive!
Both strategies should be valid, the specialized one would indeed require a non-regression step. This is a versatile book with a stronger closed position signal, imo safe to use as normal book. Probably more universal, due to closed positions heavy underrepresentation in default. Distribution is evened out in regards to opening type instead of opening availability.
Another point is that for open positions search is a nifty tool, so its closed positions which need elements.
Influence of the book on Elo difference. noob_3moves.epd vs closedpos.epd. Basically, books have a similar Elo performance, for both SF10 - SF11, as well as SF11 - SFdev.
closed:
ELO: 17.94 +-1.7 (95%) LOS: 100.0%
Total: 60000 W: 13779 L: 10684 D: 35537
Ptnml(0-2): 880, 6085, 13460, 8210, 1365
https://tests.stockfishchess.org/tests/view/5ea415c913fcd4bb2f00a0e4
noob:
ELO: 17.91 +-1.7 (95%) LOS: 100.0%
Total: 60000 W: 13292 L: 10202 D: 36506
Ptnml(0-2): 814, 6166, 13525, 8106, 1389
https://tests.stockfishchess.org/tests/view/5ea415c913fcd4bb2f00a0e4
closed:
ELO: 50.59 +-1.8 (95%) LOS: 100.0%
Total: 60000 W: 17819 L: 9143 D: 33038
Ptnml(0-2): 586, 4917, 12288, 9653, 2556
https://tests.stockfishchess.org/tests/view/5ea413e913fcd4bb2f00a0d3
noob:
ELO: 48.18 +-1.8 (95%) LOS: 100.0%
Total: 60000 W: 17306 L: 9038 D: 33656
Ptnml(0-2): 619, 5006, 12298, 9642, 2435
https://tests.stockfishchess.org/tests/view/5ea415ac13fcd4bb2f00a0e1
closed:
ELO: 20.12 +-1.8 (95%) LOS: 100.0%
Total: 40000 W: 7149 L: 4835 D: 28016
Ptnml(0-2): 211, 3221, 11101, 4977, 490
https://tests.stockfishchess.org/tests/view/5ea45e85b908f6dd28f34ada
noob:
ELO: 17.45 +-1.7 (95%) LOS: 100.0%
Total: 40000 W: 6357 L: 4350 D: 29293
Ptnml(0-2): 224, 3109, 11590, 4590, 487
https://tests.stockfishchess.org/tests/view/5ea45e72b908f6dd28f34ad7
I think this indicates that the book is pretty general purpose.
I will now reschedule a few of the recent yellow LTCs that presumably target closed positions with the new book
Can I ask authors of recent yellow LTC patches (e.g. @Vizvezdenec @xoto10 @locutus2 @MJZ1977 @Lolligerhans) that target closed positions to resubmit them LTC, with the new closedpos.epd book, putting closedbook in the info field as well? Looks like a few of them will need rebasing so I can't easily reschedule.
I've reschedule 2 that were based on current master: https://tests.stockfishchess.org/tests/view/5ea49685b908f6dd28f34b85 https://tests.stockfishchess.org/tests/view/5ea4969ab908f6dd28f34b87
I will retest with the closed book my pawn chain patches . I had three similiar version which all passed STC and failed LTC yellow.
@vondele I had no such patch. I kept track of yellows so I am pretty sure. :)
Unrelated to the current topic, but the last regression was only ~11elo, but @vondele's LTC tests are showing 18/20 elos respectively for closed book/noob book. I know we use a different book for regression, but still a bit surprising.
Very interesting results! Am i right in thinking this book is about the same size as noob_3moves ?
So we've used noob_3moves to play a lot of games, then sampled games we're interested in after 8 plies - is that 14 plies from startpos then? That might be a concern for long-term use as the standard book, but given the performance tests give very similar results to noob_3moves, I'm happy to test it out for a couple of weeks. Definitely a plus point to just update the main book instead of having a choice, and having to do non-regression tests against the main book, I just hadn't expected this to be an option. Interesting ...
well side note that last RT has different master that was behind by 2 elo patches and one simplification. Also it's kinda expected I guess with 2 space/blocked positions interacting patches...
Yea well usually I wouldn't expect a 7-9 elo difference with just two elo gaining patches lol...
@adentong RT's use 8_moves book, which has the lowest elo spread (around 10% less). This makes the +50 elo between versions more meaningful. On top of that are the 3 patches, an undefined small effect of book optimization, and double error-bars.
I indeed wouldn't focus to much on the comparison to the RT, it is indeed not exactly the same version of the code, and the 8moves_v3 book is known to yield less Elo difference. The draw rate is slightly different with the books as well 8moves 0.74, noob_3moves 0.73, closedpos 0.70. This all looks good IMO.
There have been a number of tests overnight using the new book (on old yellow LTCs): https://tests.stockfishchess.org/tests/view/5ea49685b908f6dd28f34b85 https://tests.stockfishchess.org/tests/view/5ea4b95ab908f6dd28f34bde https://tests.stockfishchess.org/tests/view/5ea4a0dcb908f6dd28f34ba4 https://tests.stockfishchess.org/tests/view/5ea4969ab908f6dd28f34b87 https://tests.stockfishchess.org/tests/view/5ea4a14cb908f6dd28f34bab https://tests.stockfishchess.org/tests/view/5ea4a0efb908f6dd28f34ba7 none of them passed, and IIRC one yellow.... probably not too surprising.
So let's get the expectations right. The closedpos book is not a magic bullet, and it will remain a real challenge to get patches passed.
Based on the data collected, my proposal is to switch the default book to closedpos.epd relatively soon, used for essentially all tests (but not RT), and just continue testing as before. In particular, after passed STC and LTC tests on closedpos, PRs can be made, no need for additional non-regression tests. After a couple of weeks (June?) this strategy is reassessed.
Give thumbs up or down if you agree or disagree with this proposal.
@vondele I would prefer more to do a non-regression against noob book but more in the sense of monitoring to be alarmed if it goes really bad. Here we can probably use weaker bounds like [-2;0].
But the the best approach seems for me to do a mixed book: 50% positions from closed book and 50% positions from noob book. So we would have the best of two worlds: closed position testing but no overfitting to this type of positions IMO.
@locutus2 I plan to do the monitoring based on the usual 8moves RT runs.
My argument against doing additional non-regression tests is that I want to keep our procedure as simple as possible. I'm also pretty confident that regression are unlikely. But if there is a strong feeling in favor of the additional testing on passed patches, I'm fine with it. So, let's see what the vibes are.
I'm not in favor of mixing the books. Let's try to get a clean signal. Again, the book is not extreme, and there will be opinions going in either direction (e.g. @MJZ1977 would like to see it more closed, you prefer a little more open).
@vondele About the clean signal point: Ok i understand it from scientific standpoint it is good to get clean data about the closed book to asses it (here i'am with you). But its important how we go from there. Say the closed book seems good: take we then this further or mix it with par example the noob book (which till now also works). Here only the second one seems to avoid biased development and i think it is not good to go now from one extreme (unusual open positions) to another (near closed positions) so mixing up seems the best approach.
@locutus2 long term I can indeed see the point, and we can reassess.
Short term, let's figure out if the book actually matters much. I think this is an experiment to try and see if the perceived weakness in closed positions can actually be more easily fixed with a closed book (if one looks at the positions, it really is not that closed). We might find that this is not as important as we think.
This is in part an old discussion, the many years of development with the 2moves book, which really was not very sophisticated, illustrated that the book might not be the key ingredient to progress.
I think we can keep the 2 books for instance and change the default once we have the ideas clear. It will be interessant to find a patch that shows a big gap between the 2 books. Green to "closed book" and red to "noob book". Then we can conclude.
Last night I was thinking this was a big development ... now seeing the results of the reruns, it seems it doesn't make much difference at all. Perhaps there is a subtle change that we will become aware of over time. At the moment (very early of course), it seems the lower draw rate is perhaps the main change (benefit?) of this.
My main concern if we switch to using this book for the medium term remains the beginning of the game. If we want sf to get better at the early moves, surely we need a test book that includes small ply openings (say 0-5) as well as longer ones?
The way I understand it is that we get positions which, in its games Stockfish closes the position (please correct me if I misunderstood something). But what about games that Stockfish fails to close the position? For example, when searching from root, very commonly we see the exchange French, etc. Something feels off about it.
I believe that the beginning of the game is too vague to be helped by eval, due to very high availability of viable options and different setups. But as the midgame eval becomes more accurate, it will show at openings via better steering of search.
This book should not be regarded as a specialized closed position book, but as an attempt for a more balanced general book in regards to position type. The conditioning is soft and leads to open positions too. The problem with typical books is that they are balanced in regards to viable opening availability, thus tiny signal of truly closed positions. SF has problem with those for 3 reasons:
Search inefficiency (and unfortunate setup selection) has partly to do with seeking generically favorable evals: A highly valued bonus in a static position acts like a black hole for the search. It sucks up all the resources to that direction, because it "believes" its something supreme, blinding it for alternatives. An example is a very deep knight outpost at totally blocked flank + space advantage. Totally useless at a glance for chess players, but SF aiming for it form early game even.
Removing those black-holes completely will require "alien" tech like pattern-recognition, MCTS, NN, or a detailed categorization of cases. But an increased representation of black-hole situations will surely boost long-term health.
I don't believe SF needs training at positions that are very easy for it, nor is it in danger of regressing. At tactical cases the various paths are narrow and concrete and search shines.
But what about games that Stockfish fails to close the position?
Good question. I guess there will be a few d4/e5 French advance structures in this book, perhaps this can be an iterative process and the book can be recreated occasionally? If we can improve sf's blocked position play a little, then it will choose more blocked positions ... then we can improve it's play a little more ... etc
Edit: or we could just get some games from somewhere else, no reason to only use fishtest? e.g. http://data.lczero.org/files/match_pgns/1/
I believe there have been some valid concerns raised in this thread, enough so that we should consider alternatives. I have now built a new book with a very different approach based on these comments. I'll again do some testing on fishtest later. The major concerns I have seen raised are:
To address this, I made a book based on the frequency of FENs in games played at lichess (restricted to Elo > 1800, TC > 60). I retained the 200k most frequent FENs out of >8M games. (see https://github.com/official-stockfish/books/pull/9)
This have the following advantages:
Of course, the choice of the initial database will somewhat influence the resulting FENs, but I think that's more or less secondary.
Edit: the Elo testing yielded the following:
SF11 -> master (STC)
ELO: 11.89 +-1.6 (95%) LOS: 100.0%
Total: 60000 W: 13791 L: 11738 D: 34471
Ptnml(0-2): 763, 6016, 14647, 7553, 1021
https://tests.stockfishchess.org/tests/view/5ea7e0a953a4548a0348ecb1
SF11 -> master (LTC)
ELO: 14.61 +-1.6 (95%) LOS: 100.0%
Total: 40000 W: 7331 L: 5650 D: 27019
Ptnml(0-2): 181, 3045, 11987, 4486, 301
https://tests.stockfishchess.org/tests/view/5ea7e0d653a4548a0348ecb5
SF10 -> SF11 (STC)
ELO: 43.35 +-1.7 (95%) LOS: 100.0%
Total: 60000 W: 17566 L: 10119 D: 32315
Ptnml(0-2): 531, 4776, 13411, 9279, 2003
https://tests.stockfishchess.org/tests/view/5ea7e0c353a4548a0348ecb3
So the Elo spread is somewhat small on this book.
Anybody has a pointer to another pgn database of high quality games (e.g. master level, ICCF), but it will need to be > 2M games to be suitable to build a book, I would say.
Alternatively, a subset of high quality leela training games (again >2M) ?
noob_2/3moves books were selected to avoid drawish openings IIRC, but the closedpos book just turned out to have a good Elo spread without any explicit drawish checks. (I wonder why?)
Do you have any info on how many of these popularpos lines qualify as closed under the closedpos tests? Maybe we need a not-drawish test if we want to consider these popular and more open lines?
noob_2/3moves books were selected to avoid drawish openings IIRC,
No they were not. In fact their draw ratio is rather high. Note: for the same Elo you want the highest possible draw ratio (= least amount of noise). It you want to lower the draw ratio convert every draw into a win or loss using a coin.
I ran a second test on a book popularpos_lichess_v2.epd
which was contructed retaining games from >2200 Elo players only. The result, however, is nearly identical:
ELO: 43.41 +-1.7 (95%) LOS: 100.0%
Total: 59896 W: 16875 L: 9430 D: 33591
Ptnml(0-2): 492, 4789, 13408, 9300, 1959
https://tests.stockfishchess.org/tests/view/5eab03cb09d25e8e5058169b
the noob_3moves book was not selected specifically to avoid drawish openings, but it might be a side effect of how the database has been constructed.
My books were built from one simple rule: pick moves that are top N and not worse than a score threshold. I find it interesting that the result converges with a book built with human games.
I did a quick analysis (depth 13) of the score of the book moves, and that highlights quite some difference between the 2 classes of books:
basically, the human games, even in these 'popular positions' have a much broader range of scores, i.e. essentially won or lost. This improves only very little with Elo of the players. I think the main problem is that these human games are mostly very short TC (>60s, but typically 180s). So, if anybody has a clean database of long TC games between good players...
the human games, even in these 'popular positions' have a much broader range of scores, i.e. essentially won or lost
Yes the RMS bias is around 90. See https://tests.stockfishchess.org/tests/stats/5eab03cb09d25e8e5058169b . Comparable to the 8moves book which is also derived from human openings IIRC.
Sorry I misremembered. The RMS bias of the 8moves book is around 60. It was the 2moves book that had an RMS bias around 90 (showing that biased openings are not necessarily bad).
By comparison the RMS bias of the noob_3moves book is around 30.
1 node Leela is around 2500 elo (on big SV nets - dense knowledge). How about trying a book based on eval divergence to SF search ply x? It should be very rich of SF blindspots.
So, I tried a 3rd book based on 'popular positions', namely ranking them with their frequency on lichess times the frequency on fishtest (LTCs games based on noob books only). The result is human style positions that stockfish would play as well. I think it is a nice book, with mostly openings found in master level games as well, no blunders, lots of nice opening lines etc (@MJZ1977 you might want to check). It has book move scores very similar to those of noob books:
yet, the Elo spread remains low, in fact very similar to the previous two versions:
ELO: 43.77 +-1.8 (95%) LOS: 100.0%
Total: 59940 W: 16150 L: 8638 D: 35152
Ptnml(0-2): 590, 4919, 12901, 9509, 2051
https://tests.stockfishchess.org/tests/view/5ead5bab6ffeed51f6e3257e
So, good Elo spread seems to be a different property..
@vondele On top of that all your books seem to have lower draw ratio than noob_3moves.epd. So for SPRT they would be even less efficient than what the Elo spread suggests....
Draw ratio, RMS bias and Elo spread seem to be independent properties of a book which seem to be impossible to predict. Very strange.
Imo Elo spread is the most important and closely linked to the sustained (thus mainly positional) complexity of a position, providing room for outplay. Initial complexity with tendency to suddenly resolve to any outcome might reduce draw-rate but with high randomness.
@NKONSTANTAKIS how does that sustained complexity
assumption match the great Elo spread of the noob_3moves book ?
Edit: many more complex positions in the popular pos v3 book.
I have a few theories, in order of likelihood: 1) My scores of the leaf scores are more accurate, and I fiddled back propagation of scores based on weighted averaging of top N moves within a certain range. 2) No selection bias for what move to play(human games have strong preference from opening theories). 3) I cut lines that had bad moves from both sides but the position ended up being balanced.
@vondele I would not say great, but good elo spread. noob_4moves has a bit higher and drawkiller the highest so far counted. I think this is mainly due to the max eval filtering of 1-sided position tendencies, which favor the weaker engine. If the presented problem is simple enough for a 3200 elo to solve, how to differentiate to a 3400 elo? Basically @noobpwnftw 1. , even more powerful.
Another filtering which would help is equal but dry positions, which are low on randomness but also low on sensitivity. Those would probably mean that usually one side has to blunder, to not be draw. This kind of positions might be suitable for lower TC, and lose value as quality rises. But how to define "dry"? As a chess perspective hard to apply it to a database, but I think a pretty accurate signal would be the % of draw in results, especially when disproportionally rising with TC.
The closed book should have lower % of this kind, having higher elo spread at LTC.
I explain the highest spread of drawkiller to the artificial asymmetric re-arrangement of pieces + the existence of all pawns. Pawns increase the long-term strategic complexity, as irreversible decisions.
I have 2 ideas as propositions for elo-spread testing:
Drawkiller might be a bit dangerous, as too specifically lopsided, but 960 from the get go should help a wide and unbiased understanding of chess. When the openings are too similar it creates overfitting, while 960 might create underfitting, with lowered correlation to normal chess. An interesting experiment nonetheless.
If I'm to make drawkiller-alike books, I might take a set of positions and for each position, play a dozen master vs master games and remove the ones that have high draw rate, until I get a set of positions that satisfies certain criteria. Whether such books are beneficial however is another question.
The real question is how to come up with the initial set of positions that are truly neutral and more meaningful than random play or FRC, not how to introduce whatever specific selection bias you'd like on top of that.
Neutral is good but it also has to be meaningful...dry is bad. Just lowering draw rate would probably worsen elo spread, but "drawkiller" (a misleading name) improves it by aiming for strategic complexity.
I think for drawishness there should be a threshold: 50% draw might be too little, but 90% too high. The 1st signal will be strong and noisy, the 2nd accurate but weak. Your method seems to have taken good care of the one-sided part of extremes, how about targeting the other by cutting out like the top 10% of draw dominance?
@noobpwnftw in principle we have most of the the data. There is millions of LTC games for the noob_3moves book, so in principle ~1000 games per position in the book. Not truly master vs master, but close. So, one could start such an analysis. Not sure that one needs to avoid a high drawrate (see comment by @vdbergh above). I guess one needs a high drawrate, and a large difference in 1-0 / 0 - 1 outcome.
The engine has to know how to draw those positions, this is also a part of their performance. I think the misconception is that people want to remove such seemingly trivial positions to increase elo resolution, this doesn't work. Those positions are the most common ones and they are only a few moves in, a lot of things can happen that contributes to the final outcome.
OTOH, I think a good filter for whether a position is "complex" or "closed" is time to depth(aka. pv stability). Would be interesting to see how that results.
Something trivial on high depth might indeed not be trivial at short depth. So results vary with TC. But if a position is too much forgiving, its a problem. Forced lines that lead to dry simplifications cant be productive.
@noobpwnftw time to depth was a criterion for the closedpos book in some sense (go perft 5 being a small number).
I mean what if we build a book that contains only positions which the search is taking longer to reach certain depth, those positions should be quite complicated and might give good performance resolution. Raw perft does not reflect pruning behavior, I think closed positions have a lot of moves that cannot be trivially pruned.
I actually have those numbers... let me check.
so average number nodes needed to reach depth 13:
book | nodes |
---|---|
noob_3moves | 81385 |
closedpos | 123145 |
popularpos | 113054 |
popularpos_v2 | 111785 |
popularpos_v3 | 115037 |
Weird, so the theory is right, but the result went the opposite...
Out of curiosity I checked depth 13 nodes in 2moves_v2 book. The book is relatively small (12k positions) so I analyzed whole book. The average is 134673 and histogram looks like this:
Perft 5 nodes vs depth 13 nodes scatter plot looks like below. There is no correlation at all (R=0.14).
Position with max depth 13 nodes (385505): rnbqkbnr/p1pp1ppp/1p2p3/8/3P4/4P3/PPP2PPP/RNBQKBNR w KQkq -
Position with min depth 13 nodes (28154): rnbqkbnr/p1pp1ppp/4p3/1p6/5P2/2N5/PPPPP1PP/R1BQKBNR w KQkq -
All with latest SF (2 May 2020).
I have made a pull request to the official book repo with a closed positions book. https://github.com/official-stockfish/books/pull/8 this still needs some testing, but should eventually be available.
I first want to do some testing comparing this to the noob_3moves book on fishtest before we possibly start using this, so that we have a feeling for its quality. My initial impression is rather good.
There are several options we can first discuss here before I decide on this.