Closed FauziAkram closed 4 years ago
It looks good to me. I'm ok in changing the book.
Any plans to start using this?
Maybe this certain but minimal improvement (which has been stalled) is too timid a step to address a serious issue, the undoubted fact that chess at very high elo is forgiving, meaning that less good move output is often adequate to hold a draw. This makes potential improvements far less visible.
This is all logical and simple but lets also back it up with some data:
Since SF10 a repetitive pattern is observed: The elo gain from regression tests of master vs SF10 is always much less than the additive elo of master + patch vs master. The elo bounds can not be an excuse for this phenomenon because: A) They are usually surpassed by far and B) They do so always in the same direction (The SFdev vs SF10 results always shrink the elo difference). This has a of course a lot to do with the different book, but also with the point that the more dull (or one-sided) the positions provided, the less superiority is translated into elo.
The latest regression test, that featured the biggest elo jump (+6.5 elo +-bounds) had the same drawrate with the previous one! A skeptic could regard this as a statistical anomaly, but my explanation is different: "The elo superiority derives mainly from the uneven positions (not the ones which are one-sided but those that can go either way and the ones that are borderline winnable-holdable), and those represent a minority in our books. Most provide positions which are "not complex enough for this level" which translates into "less likely to depict playing strength". Hence this characteristic drawrate to elo gap ratio chart, also evident in the late stages of SF9dev->SF10.
All these points are visible at: https://github.com/glinscott/fishtest/wiki/Regression-Tests
So, what I am proposing is experimenting not with books which artificially lower the draw rate (As it has already been demonstrated by vdbergh that lower drawrate does not always translate into higher resolution), but with books that magnify the elo-difference, in order for the improvements to be more visible. This would obviously result into resource economy. An breakthrough of results by Stefan Pohl (with his book) shows that for similar elo discrimination only 25-30% of games are required, which is impressive. (details in link)
A concern here could be that by feeding more "corner-case" positions, SF play could deteriorate in "normal" ones while getting better at the former. My answer is that this is unpredictable (it could be neutral, negative or positive for each patch), but probably minimal, plus we could also check it by testing vs old book and comparing.
Another argument at this is the question: "Aren't we already deteriorating play from the starting position by tuning with our book?" To which the answer is: "Probably yes but by not specializing in the starting position we gain variety, plus if we were using the starting position our small improvements would be usually buried into the high drawrate." Its an obvious gain. We could take this sophistication a step further.
So the proposed book to experiment and possibly incorporate is the work of Stefan Pohl: https://www.sp-cc.de/drawkiller-openings.htm
Stefan's enthusiasm, results, and potential for SF are evident.
The drawrate of those books varies from 33-40%, which is within the range that I had assessed as ideal in my previous post, by keeping the % of the 3 possible outcomes close to 1/3 (draw outcome can be higher). This drawrate might seem too low, but keep in mind it does not derive from one-sided openings (eg 60%-40%-0% which would be obviously poor.
@mcostalba @snicolet and @all, I can hardly express the importance of this topic and the positive prospects on the development, which I personally consider guaranteed. But as I can only transmit my grasp as much, unavoidably leaving many unconvinced or less enthusiastic, I kindly request for research, testing and actions in order to solidify the justification of future changes, something that I am sadly unable to provide.
The observed phenomenon where elo gain from regression tests of master vs previous release is always much less than the additive elo of master + patch vs master will always exist. One reason is that one little change vs master will always have the maximum Elo impact for that change in self play Overtime, as more patches are added to the difference between versions, there will be patches added that that will not show the same accretive Elo gain due to other patches being added at the same time. Some may call it a diminishing return. Patch A tested against current dev version shows 3 Elo gain. Patch B tested against current dev version shows 3 Elo gain. But when you test the new current version, which has patches A and B , against prior dev version , the combination may only show 4 or 5 Elo gain. It happens all the time and there is nothing you can do about it. It will always exist.
Who can make a big book with the union of Polh’s « draw-killer » and our « 2moves-v2 » book, so that we can start testing it seriously with the different strategies proposed in this thread?
OK, to keep the discussion rolling I have tried to make an hybrid book built from the following ressources:
• 12092 positions from the file "2moves_v2.epd" posted by Dariusz Orzechowski in this post: https://groups.google.com/d/msg/fishcooking/cO5bF2_a6Ow/dZqe7LrfBgAJ • 15962 positions from the file "Drawkiller_balanced_big.epd" created by Stefan Pohl and Hauke Lutz (I think), doawnloaded from the nice web page at https://www.sp-cc.de/drawkiller-openings.htm
The result is the archive "hybrid_book_beta.zip" and contains 28054 positions (in two formats: epd and pgn).
@snicolet Very nice! I do hope we can start experimenting with it asap.
@snicolet I also recall seeing mentions of the pentanomial model, the code for which seems to already be in place. Is that also something we can investigate, since @vdbergh mentioned it saves around 10 to 15% of games?
@snicolet Thanks, such a hybrid approach is sensible for a final choice with the strategy to balance the exaggerated effect of the drawkiller. As a next step, we could either:
All this is very exciting!
As castling is already manually done in drawkiller, I'm skeptical of using it as a big part of a testing book, I'd expect it to skew significantly tuning results and some patch tests. Now, many general chess improvements would still apply across (just like they do to SF in chess960), but optimizing towards drawkiller will at least for a time hurt strength at more "normal" chess.
Running 1K games or whatever of fishtest conditions games on every opening of 2moves_v2, then filtering out those which have above some threshold of drawn game pairs (maybe better to use two different SF versions for this kind of testing ? Or unequal TC) would probably be the best to increase elo sensitivity while having something representative of "normal chess".
I agree with Alayan. All castling related patches will be biaised with draw killer openings.
I think also that adding some very closed openings (like some variations of french or stonewall) is not a bad idea. SF is relatively weak in these positions.
Alayan point is valid but there is also a high probability that "normal chess" is too non-critical to matter much. On the other hand by tuning solely with 2moves_v2 one can argue that we optimize far too generically on positions that appear much rarely than others. Also with logical chess many transpositions to similar systems occur, boosting the importance of some openings and diminishing others. With drawkiller the specialization straight to the highly strategic positions might be able to naturally teach SF the strategical aspects we are trying hard to code.
So as we can only speculate and there is wild potential involved, I think that some tests are worth the investment and bring valuable info. Many interesting possibilities... for example If we want to check how the drawkiller relates to 8moves book compared to 2moves_v2 we could do a wide LTC tuning run with drawkiller on many sensitive variables and test it vs master on 8moves.
Another topic I would like to put is the injected noise of the randomized book. Very serious even now, it will probably become worse with critical openings. If we paired instead a fixed number of positions to appear equal times, we would gain a lot of accuracy by comparing apples with apples. That probably means dropping sprt. For example with a 10.000 position book by pairing on all of them as a set, I estimate that we would get rid of significant noise. We could repeat to lower the +- elo range only as many times as we need in order to make a decision. I recon that this system would be more economical, more accurate and much more flexible than sprt: Due to every patch and every situation being unique, different treatment is logical. Hence by manually repeating the set as many times as maintainers require for taking a decision, we also introduce discussion in between.
Currently we demand same sprt bounds for a patch that adds a lot of complex code or just 1 line, same simplification bounds for a patch that removes a lot of code with one that removes a few variables. We try to do the adjustments afterwards, but its neither economical nor accurate. For example many times after a passed "simplification", maintainers and consensus reject it because it could be unimportant and taking a lot of games indicating small elo loss. Or for a big patch that adds complexity a 2nd LTC or VLTC or VVLTC is asked in order to get more elo justification (or in hope to escape ugly code, depending on point of view :).
This train of thought leads me to another idea: Instead of using STC for filtering, to use a specialized filtering book set (smaller, higher resolution, very balanced and wide). Around 5.000 positions feels ideal, straight at LTC equals 30.000 STC games in resources, with huge benefits:
See @noobpwnftw comment in the other thread here for links to huge new books with 2 moves, 3 moves and 4 moves.
https://github.com/official-stockfish/Stockfish/issues/2283#issuecomment-531478866
I think that now the time has come to make a decision where to store these new books, so that we can begin testing them in fishtest.
Any concrete proposition?
@snicolet we could start storing the books&binaries repo in someone personal repo (to be pragmatic: or my repo or your repo) waiting for @mcostalba @glinscott @zamar to move it in the future in https://github.com/official-stockfish
By the way at the moment we have little (or zero) control on:
I'm interested on other POV ( e.g. @tomtor @vondele @noobpwnftw )
@ppigazzini I think you raise an important question on the governance of stockfish.
While the current structure works well for day-to-day development, and the project is a very nice example of a community driven effort, there are sometimes problems because those who created and own some of the infrastructure have taken a step back from active development, without having handed off the project to (a) competent successor(s) (Lesson #5). Maybe @snicolet can engage in a chat with @mcostalba @glinscott @zamar to hatch a long-term plan?
Main issue is that @glinscott pays for the current infrastructure and it is obvious that he does not want to hand over the keys, because that could result in additional costs at the current provider.
Although @ppigazzini and I are currently active on maintenance, it is realistic that that would change in the future. So transfering the infrastructure to another environment which is owned by an individual or its employer would not really change or improve on the current situation.
What is needed is some funding, sponsorship, to pay for costs at a standard cloud provider, so ownership can easily be transferred to new active maintainers.
Perhaps something like: https://libreelec.tv/category/funding/ (sponsorship by cloud infrastructure provider) is also an option?
@tomtor .. first, I really would like to explicitly acknowledge the efforts and ongoing support by e.g. @glinscott in creating/running this infrastructure.
I agree that, just a one-off transfer to another individual is indeed not a solution, which is why I used 'governance'... trying to imply that probably control or management of a project should be more explicitly by a group of people. It is very natural that people in the course of a long running and successful project would like to take a step back, and that should be easily possible without affecting the project much. It might be that the current structure is fine, but that some more people need to be entrusted with ownership of all components of the infrastructure.
Possibly funding/sponsoring is one issue, but I'm not sure it is the biggest hurdle, especially not for a project that has so many users/fans/supporters.
first, I really would like to explicitly acknowledge the efforts and ongoing support by e.g. @glinscott in creating/running this infrastructure.
I agree fully! Without him we would not have any infrastructure or fishtest itself...
For immediate solution to handling new files, my links should work and you can implement hash checking on the client side. If you want me to host extra files, let me know.
For the rest I guess first is to settle with a group of active infrastructure admins and a plan for sustainable resources, then ultimately migrate from stuff that becomes unavailable.
I'm OK with a few redeployment of workers or new site locations and I believe that's the last thing people care about.
I think that the starting position is of utmost importance. It can't be used solely of course due to overfitting and losing generalization.
I propose that it is used in a high% of the games, like 10%.
Imo this will naturally fix 2 annoying issues:
As it is, the generic values which are obtained through 2moves_v1 steer SF towards french defence, which is both considered inferior, and SF misplays it.
Note that SF is weakest at closed positions, something expected since:
A simple solution is to include a considerable (~20%) representation of closed positions
Closing this thread, since we have moved to another default book now.
A running project to improve 2moves_v1.pgn, removing the "bad" starting positions, which can be a position with a forced drawing line since the first move, or an extremely one-sided position. This is the first patch, which contains 37 positions to be removed, and more patches might come if this first patch gets approved.
Discussion: https://groups.google.com/forum/?fromgroups=#!topic/fishcooking/cO5bF2_a6Ow Sheet: https://1drv.ms/x/s!AujF4uRoZmV9gmYIFJgGk7CGvdCt 2moves_v2a.pgn (The new pgn, after removing the first patch of "bad" position): https://1drv.ms/u/s!AujF4uRoZmV9gmhe7wdKD-2zP9en