Open mooskagh opened 6 years ago
As reported in the Discord channel, in this game, on move 138, NN303 hangs a rook out of the blue costing the drawn game almost instantly. The next move the eval had swung +9.42:
[Event "DESKTOP-RV5DCNB, Blitz 1m+1s"] [Site "Rio de Janeiro, Brazil"] [Date "2018.05.18"] [Round "18"] [White "Spike 1.4"] [Black "lczero v0.10"] [Result "1-0"] [ECO "D94"] [Annotator "0.09;0.01"] [PlyCount "283"] [TimeControl "60+1"]
{Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz 3293 MHz W=22.2 plies; 4,701kN/s B=17.1 plies; 1kN/s; 45 TBAs} 1. c4 c5 2. Nf3 Nf6 3. Nc3 e6 4. e3 Nc6 5. d4 { Both last book move} d5 {0.01/18 2} 6. a3 {0.09/16 3 (cxd5)} cxd4 {0.06/18 2 (a6)} 7. exd4 {0.17/16 2} g6 {-0.02/19 2 (Be7)} 8. Bd3 {0.59/16 2 (cxd5)} Bg7 { -0.13/19 2 (dxc4)} 9. O-O {0.59/16 2} dxc4 {-0.12/19 2} 10. Bxc4 {0.42/18 12} O-O {-0.16/19 1} 11. Bg5 {0.47/18 3 (Re1)} h6 {-0.34/18 1} 12. Be3 {0.47/17 3 (Bh4)} b6 {-0.34/18 2 (Nd5)} 13. Qe2 {0.44/14 2} Bb7 {-0.27/19 2} 14. Rfd1 { 0.36/14 2 (Rad1)} Ne7 {-0.29/18 2} 15. Ba6 {0.34/14 2 (Bd3)} Bxa6 {-0.31/18 2}
Self play No 14200622. http://lczero.org/game/14200622 or with game analysis https://lichess.org/study/Smk8bomB/5TpAhSFw
Network: 1831f4884d6da86fe369d4a51fbfe6a433703fbf97b0e7122898170c1eede5e0 ID329
Its just a game from generated self play using googles colab Leela_K80 Looks like the queen would move to get out of discovery....
Self play No 14215228. http://lczero.org/game/14215228 or with game analysis https://lichess.org/study/Smk8bomB/noxTVxVv
Network: 1831f4884d6da86fe369d4a51fbfe6a433703fbf97b0e7122898170c1eede5e0 ID329
Its just a game from generated self play using googles colab Leela_K80 Mate in 1 is missed.
Self play No 14217918. http://lczero.org/game/14217918 or with game analysis https://lichess.org/study/Smk8bomB/RS645It6
Network: 1831f4884d6da86fe369d4a51fbfe6a433703fbf97b0e7122898170c1eede5e0 ID329
Its just a game from generated self play using googles colab Leela_K80 Gave up queen for no reason.
@dstark1993 -- Self play games (for training data generation) have randomized moves. Although interesting to see some examples of this randomization, it's not really what this issue is about.
How can you tell if it was randomized move or made by leela "on purpose"? its something that can be checked?
@dstark1993 - yeah, you could check this by running the engine outside of training, being sure to provide some history, and seeing what the engine does.... However, that's a little tedious... I'm going to be trying to figure out some good ways to analyze the training data in a more automated way.
Good, cause i dont understand very much in programming, im generating games with google colab. Dont think ill be able to use lc0 on my laptop (power and/or my skill for figuring how to)
Id: Loss of Queen Game: https://lichess.org/fzuBrqhf#72 Bad move: 36. Re5, Qxc2, (Stockfish eval goes from -0.2 to +7.3) Correct move: Qc6 Configuration: lc0 cudnn - May 19 (default parameters), Windows 10 x64, Nvidia Titan V, Intel i5-7400T quad core, 32 GB RAM Network ID: kb1-256x20-2100000.txt.bz2 Time control: 60" + 2" Comments: Leela was ahead but blundered, losing a Queen, and then resigned.
That last one "Loss of Queen" is worth monitoring a bit I think. Its another multi-move blunder with the theme again being "removal of defender". Am hoping to see Leela gradually work these oversights out of her system.
The "Loss of Queen" removal of defender with check isn't actually too bad - at 10k nodes ID 329 Leela won't play the blunder Qxc2, and while the policy for Qxc2 is N 65.89%, the listed Best Move is the 3rd highest policy at 3.93%. Interestingly Leela actually prefers Rb8(N 2.21%) at 10k nodes, which isn't a blunder, and Stockfish says the position is even after that.
After Qxc2, the policy for the refutation Re8+ is 2.17% and is preferred at 5k nodes search. Rxg5 is also good for white, but not as good (QN v RR for white), and is N 3.78%. So there's lots of ways for Leela to get out of this.
Id: Loss of Knight Game: https://lichess.org/pJh8VCTA#89 Bad move: 45. Rxb5 (Stockfish eval goes from -1 to -5.1) Correct move: f3 Configuration: lc0.exe cudnn - May 22 (default parameters), Windows 10 x64, Nvidia Titan V, Intel i5-7400T quad core, 32 GB RAM Network ID: 330 Time control: 60" + 2" Comments: Leela blundered, lost a piece, and then resigned.
Id: Leela blunders, steps into checkmate in #14 Game: [https://lichess.org/qNgs7Cwg#95]https://lichess.org/qNgs7Cwg#95) Bad move: 48. Qc6, Bg3 (Stockfish eval goes from +4 to +13) Correct move: Rde8 Configuration: lc0.exe cudnn - May 22 (default parameters), Windows 10 x64, Nvidia Titan V, Intel i5-7400T quad core, 32 GB RAM Network ID: 330 Time control: 60" + 2" Comments: Leela blundered, faced checkmate in #14, and resigned.
I've been posting on the forums about some work I'm doing to try to automate blunder detection. Please see here: https://groups.google.com/forum/#!topic/lczero/8lK5ldgZUHA
So far, this does seem to work to find things like buggy (old) engines playing match games, as well as blind spots (missed mate in ones, etc)... However, I haven't fully figured out exactly how I'd like to measure things to get a reliable signal. Once I improve this, I hope to apply this to new match games as they come out to look for bugs and blunders.
I've actually spent a little time today doing something similar with the PGNs from the CCLS gauntlets. It's easier in a lot of ways because the opposing engine will spot the blunder, so I've started just scanning for situations where Leela's eval drops by a certain threshold (right now -200 centipawns) - this means she made a move and didn't see the refutation that was coming from the other engine. I'm also filtering it for where Leela's eval didn't start worse than -2.00, because if she's already losing and blunders more it's not so interesting in my opinion. A lot of that helped avoid "losing faster" endgame moves from cluttering it. It's still in its infancy and won't be worked on this weekend, but hopefully it can be converted into generating some automatic tracking positions for future IDs as another way to check progress.
Using that method, here's the worst I could find from the last 500 match games: 311088-311587
White has a mate in one on move 32, but misses it and gets check-mated a couple moves later: http://www.lczero.org/match_game/311146
White doesn't protect against black's mate in one on move 46 (according to Lichess, white had a small advantage), but then black misses the opportunity and ends up with a draw: http://www.lczero.org/match_game/311121
White gives black a mate in two opportunity on both move 94 and 96, but black misses it both times: http://www.lczero.org/match_game/311554
Black had a forced mate in 5, but ends up with a draw: http://www.lczero.org/match_game/311240 https://lichess.org/analysis/8/8/pRpB4/2P2k2/P3p2K/4P3/5n2/6r1%20b%20-%20-%204%2047#93
White missed a simple tactic to capture black's queen on move 42, would have won but goes on to lose: http://www.lczero.org/match_game/311127 https://lichess.org/analysis/6k1/p7/7p/1p1pN3/8/8/1qB2R1K/8%20w%20-%20-%200%2042#82
Black misses trading its rook for a queen on move 38: http://www.lczero.org/match_game/311379
Id: Leela gets checkmated! Game: https://lichess.org/KAxNVVzc#56 Bad move: 29. Qa3, (Stockfish eval goes from -0.2 to #-8) Correct move: f3 Configuration: lc0 cudnn - May 26 (default parameters), Windows 10 x64, Nvidia Titan V, Intel i5-7400T quad core, 32 GB RAM Network ID: 346 Time control: 60" + 2" Comments: Leela was ahead but blundered and was checkmated
I just re-tested NN350 against MultiMove #1 position above.
lczero.exe -w weights.txt position fen 6k1/4bpp1/2q1p2p/2p1P3/2P1N2P/2Bn1QP1/5P1K/8 b - - 0 35 go movetime 1000
info string Qd7 -> 15 (V: 33.78%) (N: 39.14%) PV: Qd7 Nd6 Nxe5 Bxe5 Bxd6 Qd3 Qc6 Qxd6 info string Nb4 -> 68 (V: 47.26%) (N: 31.72%) PV: Nb4 Kg2 Qa6 Nd2 Nc6 Qe4 Bf8 info string stm Black winrate 43.84%
That looks okay at that point, but NN350 still recommends the losing 35...Nb4 and doesn't notice the problems with it until 58kN.
She recommends the saving 35...Qd7 only after 129kN (NN316 found it in 19kN, NN311 in 29kN).
So tactically, at least in this case, things seem to have gone backwards again? Bit disappointing.
Ok, this was frokm CLOP test, and really bad. In position below, Bh3+ wins queen on the spot. Even on my quad with GTX1060, LCZero v0.10 (default settings) and id358 take 3m30s to see this!
Here is full PGN. Key move is move 28.
[Event "?"] [Site "?"] [Date "2018.05.28"] [Round "1"] [White "lc0-may22"] [Black "ice3"] [Result "0-1"] [ECO "A05"] [PlyCount "65"] [EventDate "2018.??.??"] [TimeControl "60+1"]
@ASilver... Very similar to one of the match game blunders I referenced above (although this was a little bit older)... """ White missed a simple tactic to capture black's queen on move 42, would have won but goes on to lose: http://www.lczero.org/match_game/311127 """
All of these types of issues seem to be due to Leela having very small priors when it appears that a strong piece is left under attack and undefended. In my opinion, simple 1 and 2 move tactics should set the bar for how flat the training target policies are. If training hasn't found that for many situations it might be good to put the opponent's king in check in order to capture a queen, then IMO, that's a clear signal to flatten out the policies.... For example, making the PUCT/FPU changes you offered should help. Also, it's worth noting that Chess has a much more jagged landscape than Go (and Leela-Zero seems to be the basis for a lot of thoughts regarding PUCT/FPU in Leela-chess) -- chess is full of terrain with very sharp ups and downs. It's very hard for any value head to smooth that out completely. IMO, the policy entropy needs to be increased just to give the value head a chance.
Anyway, I'm hoping that eventually the devs will agree that the train target policy entropy is too small, as well as the entropy fo the policy head (which I plan to do analysis on), and that corrections will be made to flatten this out and make it less sharp.
Ideally (if this doesn't already exist) a system would be made such that training game generation allows for metaparameters to be pushed/pulled from the server. This would allow for small changes that can easily be reverted if necessary.
ID367 doesn't see mate in two (I left her running for a bit, and she finally saw it at over 1 million nodes, which is a ""bit"" too many for a mate in two.) Default lc0 settings.
It takes ID 367 over 30k nodes to see that giving a knight for no reason isn't good. (all the while giving herself +5 eval) Default lc0 settings.
3r4/1p3pkp/1qp5/6p1/Pp2Pp1P/1P3Pn1/2Q3PK/2BR4 b - - 0 29 ID 367 doesn't see the Qf2 tactic, even with 8 million nodes. Default lc0 settings.
r3k2r/pR2pp1p/6p1/8/b2bP3/8/q2BBPPP/3Q1RK1 w kq - 2 16 ID 367 doesn't see the Rxe7 tactic, even with 2 million nodes. Default lc0 settings.
Please use forms from the first message of the issue. Just posting screenshots with comments "it doesn't see anything" is not that useful.
Also, if you do some preparation work (like importing into lichess, pointing what's the correct move and what wrong move is done instead), you'll save time to (hopefully) multiple people who will look into that, and they won't have to spend time on that multiple times.
I provided FENs for the two tactical positions, so a lichess link isn't necessary. Correct moves can be seen in the screenshot, which should be looked at as it's the whole point of the post. Configuration and id are in the post.
edit: Sorry if I seem annoyed, but everything useful is in the screenshot. The position, sf8, sf9, leela evals, the continuation, nodecount and id (although it's in the post as well). It's much more intuitive for people that are familiar with chess GUIs to look at a screenshot with all the information, than to read through walls of text.
Having the screenshot for extra details is fine, but for people trying to collate the entire contents of the thread, having one standard format makes that job much easier than trying to parse the unique format of your post, for N different "you" people making posts.
Game: https://lichess.org/study/yyIYpmle (game 4)
Bad move: 81. b7??
Correct move: any bishop move that stays on the b1-h7 diagonal
Configuration: default win-lc0 on a gtx1070
Network ID: 367
Time control: 40/40, taken 68s for the move (> 500k nodes)
Comments: She realized that she blundered immediately (eval -40 next move), so she either overlooked Rxe4 completely (very unlikely given she used 500k nodes), or she thought she could promote one of the pawns after b7. However, the rook does a very good job of threatening mate and indirectly covering the promotions (if h8=Q, Ra4#; if b8=Q, Ra4+ followed by Rb4+ taking the queen. That's altogether a 3 move tactic which she definitely should've seen in 500k nodes. I tested her with PUCT=3.0 and she still didn't see Rxe4.
Important!
When reporting positions to analyze, please use the following form. It makes it easier to see what's problematic with the position:
lc0
/lczero
version, operating system, and non-default parameters (number of threads, batch size, fpu reduction, etc).(old text below)
There are many reports on forums asking about blunders, and the answers so far had been something along the lines "it's fine, it will learn eventually, we don't know exactly why it happens".
I think at this point it makes sense to actually look into them to confirm that there no some blind spots in training. For that we need to:
--temperature=1.0 --noise
)" to see how training data would look like for this position.Eventually all of this would be nice to have as a single command, but we can start manually.
For
lc0
, that can be done this way:--verbose-move-stats -t 1 --minibatch-size=1 --no-smart-pruning
(unless you want to debug specifically with other settings).Then run UCI interface, do command:
(PGN move to UCI notation can be converted using
pgn-extract -Wuci
)Then do:
see results, add some more nodes by running:
And look how counters change.
Counters:
Help wanted: