mooskagh commented 6 years ago

Important!

When reporting positions to analyze, please use the following form. It makes it easier to see what's problematic with the position:

Id: Optional unique ID. Come up with something. :) number/word, just to make it easier to refer this position in further comments.
Game: Preferably link to lichess.org (use lichess.org/paste), or at least PGN text.
Bad move: Bad move with number, and optionally stockfish eval for that move
Correct move: Good move to play, optionally with stockfish eval
Screenshot: optional, screenshot of the position, pasted right into the message (not as link!). Helps grasping the problem without following links
Configuration: Configuration used, including lc0/lczero version, operating system, and non-default parameters (number of threads, batch size, fpu reduction, etc).
Network ID: Network ID, very important
Time control: Time control used, and if known how much time/nodes was spent thinking this move
Comments: any comments that you may have, e.g. free word explanation what's happening in position.

(old text below)

There are many reports on forums asking about blunders, and the answers so far had been something along the lines "it's fine, it will learn eventually, we don't know exactly why it happens".

I think at this point it makes sense to actually look into them to confirm that there no some blind spots in training. For that we need to:

Open position in engine and check counters
Try several times to evaluate that move with training configuration "800 playouts, with Dirichlet noise and temperature (--temperature=1.0 --noise)" to see how training data would look like for this position.

Eventually all of this would be nice to have as a single command, but we can start manually.

For lc0, that can be done this way: --verbose-move-stats -t 1 --minibatch-size=1 --no-smart-pruning (unless you want to debug specifically with other settings).

Then run UCI interface, do command:

position startpos moves e2e4 ....

(PGN move to UCI notation can be converted using pgn-extract -Wuci)

Then do:

go nodes 10

see results, add some more nodes by running:

go nodes 20
go nodes 100
go nodes 800
go nodes 5000
go nodes 10000
and so on

And look how counters change.

Counters:

e2e4 N: 329 (+ 4) (V: -12.34%) (P:38.12%) (Q: -0.2325) (U: 0.2394) (Q+U: 0.0069)
 ^      ^    ^      ^           ^          ^            ^           ^
 |      |    |      |           |          |            |           Q+U, see below
 |      |    |      |           |          |           U from PUCT formula,
 |      |    |      |           |          |           see below.
 |      |    |      |           |         Average value of V in a subtree
 |      |    |      |          Probability of this move, from NN, but if Dirichlet
 |      |    |      |          node is on, it's also added here, 0%..100%
 |      |    |     Expected outcome for this position, directly from NN, -100%..100%
 |      |   How many visits are processed by other threads when this is printed.
 |     Number of visits. The move with maximum visits is chosen for play.
Move

* U = P * Cpuct * sqrt(sum of N of all moves) / (N + 1)
  CPuct is a search parameter, can be changed with a command line flag.
* The move with largest Q+U will be visited next

Help wanted:

Feel free to post positions that you think need analyzing (don't forget to also mention network Id used, and also all other settings are nice to know)
Feel free to analyze what other people posted

Tilps commented 6 years ago

https://lichess.org/e07JvP6g - I started analyzing this - the position after the blunder has a 0.11% policy for a move which is checkmate. Takes 20k visits to get its first look and then it obviously gets every visit. I haven't tested how that varies with noise applied.

Ghotrix commented 6 years ago

isn't position startpos fen ... more convenient for this case?

dubslow commented 6 years ago

Here's an easy one ply discovered attack tactic missed by Leela after 2K5 nodes. Position: https://lichess.org/mbWjiT93#105 Twitch recording of the thinking time/engine output: https://clips.twitch.tv/GenerousSmellyEggnogPunchTrees And as the lichess analysis says, this was "merely" the cherry on top of the multiple-mistakes cake. How to swing 15 points' eval in just 3 moves!

Further analysis requested please. How many playouts until Leela even once searches the tactic?

Edit: Tilps' position is also a discovery bug, I think Leela's policy assumes that the rook can just capture the queen, which is of course prevented by the pin = discovered attack

hsntgm commented 6 years ago

@mooskagh thanks for diagram.

If i wrong please correct me.

Leela's brain gets power from memorized games and positional samples she collected in self play and we call it visits.I see she has visits comes from weights instead of alfa-beta pruning.If there is a tactical opportunity in the position but leela visits an other move much she choose it.

In basic tactical positions occurs suddenly in the game and the hardest part is to teach her this. Is it necessary to play billions of games in order to learn the tactical motifs that occur during the game?

Or can you add a simple tactical search algorithm triggers on every move working independently from visits for a while.After she find tactical move with tactical search algorithm(looks for suddenly jumps to +1 +2 etc) and enter this move tree she can collect this sample to her brain too.With this way she learn playing tactically in short time and tune herself automatically.

e2e4 N: 329 (+ 4) (V: -12.34%) (P:38.12%) (Q: -0.2325) (U: 0.2394) (Q+U: 0.0069)
 ^      ^    ^      ^           ^          ^            ^           ^
 |      |    |      |           |          |            |           Q+U, see below
 |      |    |      |           |          |           U from PUCT formula,
 |      |    |      |           |          |           see below.
 |      |    |      |           |         Average value of V in a subtree
 |      |    |      |          Probability of this move, from NN, but if Dirichlet
 |      |    |      |          node is on, it's also added here, 0%..100%
 |      |    |     Expected outcome for this position, directly from NN, -100%..100%
 |      |   How many visits are processed by other threads when this is printed.
 |     Number of visits. The move with maximum visits is chosen for play.
Move

* U = P * Cpuct * sqrt(sum of N of all moves) / (N + 1)
  CPuct is a search parameter, can be changed with a command line flag.
* The move with largest Q+U will be visited next

chara1ampos commented 6 years ago

I am stating the obvious, but I think that brute force engines like Stockfish and Houdini have the advantage that their evaluation is cheap, and they can search very deep, thus having great tactics.

Leela's evaluation is very expensive, and thus she cannot search deep enough to avoid blunders. I sense that if one could speed up her evaluation, so she could search deeper, her blunders would be greatly reduced.

On an Nvidia Titan V, where Leela cudnn can evaluate 8000 nodes per second, she did not seem to blunder, and even won several games against Stockfish, Komodo and Houdini: https://groups.google.com/forum/#!topic/lczero/YFMOPQ-J-q4

I recall that alpha zero evaluated around 100000 nodes per second on the deep mind supercomputer, which greatly improves its tactics. This begs the question: what nps did alpha zero use during its training process? I suspect the number of nps can greatly affect the quality of the games during Leela's training. If the cudnn version of Leela can be used for training, the quality and speed of training will likely be increased drastically.

mooskagh commented 6 years ago

I've added a form for problematic positions submission into the original message. Sorry for bureaucracy, but that makes it much easier to see the problem.

Ishinoshita commented 6 years ago

@chara1ampos : The DM paper says "During training, each MCTS used 800 simulations.", which is a bit ambiguous and may read as new playouts added to the tree or as visits for selected node. Thus nps is irrelevant (but for the total training time). 800 'simulations' is anyway far below 10K's of simulations you mention for match games. So, yes, AZC training may have included blunders as well, at least in early stages (like where we stand now).

Why-Sensei commented 6 years ago

ID: 0001
Game: https://lichess.org/FI3y76b0
Bad move: 22. Ng4 (and 21. Rxg8+ wasn't optimal either IMHO)
Correct move: 22. Nf1
Screenshot:
Configuration: Game: LCZero cuDNN 20180508 ID 263 vs Stockfish 9 18050811 Lvl 20 GUI: Arena 3.5.1 Settings: Threads/Cores: 4, Hash Memory: 64MB, Table Memory: 0MB, Ponder: Off, Own Book: Off, Book/Position Learning: Off, Book: none, EGTB: none System: Win 10 Professional, NVIDIA GTX 1070, Ryzen 7 1800X, 64 GB RAM
Network ID: 263
Time control: 40/12 (adjusted to CCRL)
Comments: Game was streamed on May 8th at https://www.twitch.tv/y_sensei

hsntgm commented 6 years ago

@chara1ampos why anybody ask this question maybe Alpha zero just a auto tuned stockfish derivative with neural network.The traditional chess engines elo depends tuning parameters in their code.Maybe they just do that in neural network.

Stockfish 1.01 elo 2754 in 2008 Stockfish 9 elo 3444 in 2018

Look stockfish development history it gained only 700 elo in ten years with million cpu time and genius c programmers whose tuned parameters step by step.Now we wait Leela gains 500 elo with self play.Who knows maybe the road map is totally wrong.

Why i think that because someone says leela draws with stockfish ok very good news but how can you explain these blunders and tactical weakness 3000 elo program? Leela's skeleton formed after 10 million games there is no return and this is big paradox for project.

Ishinoshita commented 6 years ago

"maybe Alpha zero just a auto tuned stockfish derivative with neural network" I'm afraid this is fully wrong, in at least:

different tree search methods (MCTS vs alphabeta)
different heuristics/evaluation functions (NNs vs handcrafted features)
different learning approach (zero chess specific human knowledge, but the rules, vs human knowledge based handcrafted features, even if computer-assisted approach may used for chosing final blend of parameters). But you're fully right regarding huge number of selfplay games needed. LCZ learn very slowly. Alpha-like is brute force at learning stage (10's of million games needed), then 'smarter' (at last more human like, one would say) at playing stage (far less positions explored during tree search compared to SF or other alphabeta engines, to achieve same strength). Training pipeline squeezes very little knowledge from each selfplayed game, so millions are needed. Might sound disappointing indeed. Far from the human way...

Why-Sensei commented 6 years ago

ID: 0002
Game: https://lichess.org/efi0R82j
Bad move: 21. Qxg7
Correct move: 21. Re3
Screenshot:
Configuration: Game: LCZero 0.9 ID 271 vs Stockfish 9 18050909 Lvl 20 GUI: Arena 3.5.1 Settings: Threads/Cores: 4, Hash Memory: 64MB, Table Memory: 0MB, Ponder: Off, Own Book: Off, Book/Position Learning: Off, Book: none, EGTB: none System: Win 10 Professional, NVIDIA GTX 1070, Ryzen 7 1800X, 64 GB RAM
Network ID: 271
Time control: 40/4 (adjusted to CCRL)
Comments: Game was streamed on May 10th at https://www.twitch.tv/y_sensei

mooskagh commented 6 years ago

Thanks for submitted the bug reports, they were very useful.

All the blunders so far can be explained by #576. The fix is there in client v0.10, but it will take multiple network generations to recover the network.

So for a few days (until at ~300000-500000 games are generated by v0.10 client and network is trained on that), don't submit any other positions, as they are likely caused by the same bug.

After that new blunder reports are very welcome!

mooskagh commented 6 years ago

For now it would be the most interesting to see examples of blunders that appeared recently. E.g. if LCzero played correct move in network id270 and now blunders. That way we'd have some examples of what exactly it unlearns and could look into training data.

LC0fan commented 6 years ago

ID: ID288CCLSGame65 Game: https://lichess.org/8mCbbkwl#240 Bad move: 121. Rc7. Correct Move: Many other moves Screenshot 1: Screenshot 2 shows Analysis by ID288 in Arena on my machine: Screenshot 3 shows Analysis by ID94 in Arena on my machine (Rc7 not listed):
Configuration: CCLS Gauntlet Network ID: 288 Time control: 1 min + 1 sec (increment) Comment: Game was streamed on May 14th 2018.

LC0fan commented 6 years ago

ID: ID288CCLSGame53 Game: https://lichess.org/0YJMfRI6#260 Bad move: 131. Ra6. Correct Move: 131. Ba1 (By Stockfish 9 on LiChess) Screenshot 1: Screenshot 2 shows Analysis by ID288 in Arena on my machine: Screenshot 3 shows Analysis by ID94 in Arena on my machine (Ra6 not listed): Configuration: CCLS Gauntlet Network ID: 288 Time control: 1 min + 1 sec (increment) Comment: Game was streamed on May 14th 2018.

mooskagh commented 6 years ago

Thanks posting, we are looking into those positions. Evaluation of this position is improved a lot in id291, which confirms the main explanation that we have now (value head overfitting).

LC0fan commented 6 years ago

ID: ID288CCLSGame72 Game: https://lichess.org/CVYOwXSK Bad Evaluation: Drew by 3-fold repetition with an evaluation of +15.98 Screenshot: Configuration: CCLS Gauntlet Network ID: 288 Time control: 1 min + 1 sec (increment) Comment: Game was streamed on May 14th 2018.

LC0fan commented 6 years ago

There I lots more examples but I will stop here then :)

LC0fan commented 6 years ago

I couldn't resist one more... ID: ID280CCLSGame7 Game: https://lichess.org/rWWqu4tx#98 Rh7 would end the game immediately by 3-fold repetition, but Leela played the losing move Kh3 instead: Stockfish 9 gives Rh7 as the only move Configuration: CCLS Gauntlet Network ID: 280 Time control: 1 min + 1 sec (increment) Comment: Does Leela handle 3-fold repetition correctly?

apleasantillusion commented 6 years ago

Interestingly, on the Rc7?? Kxc7 and Ra6 Kxa6?? blunders above, I can reproduce them with ID288 on CPU both with game history.

With just FEN, while it doesn't play both blunders, the killing responses to both blunders are given very low probability from policy, so it's just dumb luck that the engine doesn't play the blunder.

The really interesting part is that with the FEN modified so the 50-move rule halfmove counter is set to 0, it immediately sees both killing moves with very high policy outputs.

This is also true of this recent match game: http://lczero.org/match_game/268131

With game history or FEN, 292 plays 132. Rc7??, giving the obvious capture response very, very low policy output.

With FEN altered so 50-move rule halfmove counter is set to 0, it immediately sees the capture with 99% probability from policy.

Maybe these examples are just lucky, but it seems high values for the 50-move rule halfmove counter correlate with very strange blunders.

nelsongribeiro commented 6 years ago

http://lczero.org/match_game/268155

ID 292 blunders again against ID 233 near the 50-move rule coming up...

trophymursky commented 6 years ago

interesting bit based off of apleasantillusion's comment (tho I'm using 292).

the fen for the interesting position is "2r5/R7/8/8/5k2/8/2K5/8 w - - 85 121" where the policy net ID292 has Rc7 (wrongfully) at 99.91%.

specifically if you set it to 60 half moves (instead of 85) the policy net fro Rc7 is at .07%. At 65 half moves it's at .2%, at 66 it's at .71%, 67 it's at 1.23%, 68 it's at 6.53% (no longer considered the worst move), 69 it's at 89.47 percent.

I have no idea why the inflection point would be anywhere near where it, but it's definitely interesting and points towards a training bug corrupting the policy net.

so-much-meta commented 6 years ago

FYI... Regarding the a7c7 rook blunder above, I think this might be explained (partially) by https://github.com/glinscott/leela-chess/issues/607 EDIT: I guess this can be disregarded since someone confirmed that Arena and most GUIs do always send moves... Regardless, leaving this here because it is interesting to see the difference in policies with and without history.

Network 288..

With history: position fen 7r/8/R7/3k4/8/8/2K5/8 w - - 77 117 moves a6a5 d5e6 a5a6 e6f5 a6a5 f5f4 a5a7 h8c8 go nodes 1000 (==> This chooses Kb3) info string Kb2 -> 0 (V: 59.51%) (N: 0.29%) PV: Kb2 info string Kd1 -> 0 (V: 59.51%) (N: 0.90%) PV: Kd1 info string Kb1 -> 2 (V: 52.08%) (N: 1.86%) PV: Kb1 Ke3 Ra3+ info string Kd2 -> 5 (V: 59.96%) (N: 2.27%) PV: Kd2 Kf5 Rb7 Kf4 info string Kd3 -> 11 (V: 60.53%) (N: 9.70%) PV: Kd3 Ke5 Re7+ Kd6 Re8 Kd5 info string Rc7 -> 381 (V: 67.65%) (N: 80.40%) PV: Rc7 Rb8 Rb7 Kf5 Rxb8 Ke6 Kd3 Kd5 Rb5+ Kc6 Kc4 info string Kb3 -> 491 (V: 83.56%) (N: 4.58%) PV: Kb3 Ke5 Rc7 Kd6 Rxc8 Kd7 Rc5 Kd6 Kc4 Ke6 info string stm White winrate 76.24%

Without history: position fen 2r5/R7/8/8/5k2/8/2K5/8 w - - 85 121 go nodes 1000 (==> This chooses Rc7) info string Kd1 -> 0 (V: 61.53%) (N: 0.00%) PV: Kd1 info string Kd2 -> 0 (V: 61.53%) (N: 0.00%) PV: Kd2 info string Kb2 -> 0 (V: 61.53%) (N: 0.00%) PV: Kb2 info string Kb3 -> 0 (V: 61.53%) (N: 0.00%) PV: Kb3 info string Kd3 -> 0 (V: 61.53%) (N: 0.00%) PV: Kd3 info string Kb1 -> 0 (V: 61.53%) (N: 0.01%) PV: Kb1 info string Rc7 -> 500 (V: 70.63%) (N: 99.98%) PV: Rc7 Rb8 Rb7 Kf5 Rxb8 Ke6 Kd3 Kd5 Rb5+ Kc6 Kc4

so-much-meta commented 6 years ago

As to the a7c7 blunder above, I think the history's only part of the problem... The other part of the issue is that the All Ones plane (last input plane) bug really messed up policies.

Good input data was being trained on a bad policy. Consider the effect of the negative log loss/cross entropy in these examples (non-buggy network with low outputs getting trained on a buggy high output).

Here's output from network ID 280. Notice that the a7c7 move only has high probability when the all ones input plane was buggy. Essentially, I think it was bad data like this that kept messing things up.

History + AllOnesBug Policy ('a7c7', 0.8687417), ('c2d3', 0.046122313), ('c2b3', 0.034792475), ('c2d2', 0.03021726), ('c2d1', 0.0111367665), ('c2b1', 0.006555821), ('c2b2', 0.0024336604), Value: 0.5331184417009354

History + NoBug ('c2d3', 0.47858498), ('c2b3', 0.13757008), ('c2d2', 0.13545689), ('c2d1', 0.08749167), ('c2b1', 0.08396132), ('c2b2', 0.07649834), ('a7c7', 0.000436759), Value: 0.5014338248874992

NoHistory + AllOnesBug Policy: ('a7c7', 0.99920577), ('c2d2', 0.00019510729), ('c2b3', 0.00015975242), ('c2d3', 0.00015850786), ('c2b1', 0.0001421545), ('c2d1', 7.9948644e-05), ('c2b2', 5.882576e-05)]), Value: 0.5555554553866386

NoHistory+NoBug ('c2d3', 0.34282845), ('c2b3', 0.22524531), ('c2d2', 0.14119184), ('c2b2', 0.09196934), ('c2d1', 0.09108826), ('c2b1', 0.08420463), ('a7c7', 0.023472117), Value: 0.49658756237477064

Now look how all of that changed by network 286, below - now the input with missing history is starting to show the bad policy:

History+AllOnesBug ('a7c7', 0.88481957), ('c2d3', 0.043222357), ('c2d2', 0.030274319), ('c2b3', 0.017787572), ('c2b1', 0.011131173), ('c2b2', 0.011077223), ('c2d1', 0.0016878309), 0.8049132525920868)

History+NoBug (OrderedDict([('c2d3', 0.35683072), ('c2b3', 0.17884524), ('c2d2', 0.15325584), ('c2b2', 0.1069537), ('c2d1', 0.10222348), ('c2b1', 0.10148263), ('a7c7', 0.00040832962)]), 0.5084156421944499)

NoHistory+AllOnesBug ('a7c7', 0.9984926), ('c2d3', 0.00064814655), ('c2b1', 0.00030561475), ('c2d2', 0.00022950297), ('c2b3', 0.00016663132), ('c2d1', 8.821991e-05), ('c2b2', 6.930062e-05)]), 0.8271850347518921)

NoHistory+NoBug ('c2b3', 0.35689142), ('a7c7', 0.227083), ('c2d2', 0.1410887), ('c2d3', 0.10505199), ('c2b1', 0.078001626), ('c2d1', 0.0670605), ('c2b2', 0.024822742)]), 0.49565275525674224)

By the time it got to network 288, the policy was really bad in this particular spot: History+AllOnesBug ('a7c7', 0.81777406), ('c2b1', 0.0735284), ('c2d3', 0.045673266), ('c2d2', 0.044812158), ('c2d1', 0.011020878), ('c2b3', 0.0059179077), ('c2b2', 0.0012732706), 0.9999993741512299)

History+NoBug ('a7c7', 0.8040016), ('c2d3', 0.0970014), ('c2b3', 0.04580218), ('c2d2', 0.022658937), ('c2b1', 0.018647738), ('c2d1', 0.008990083), ('c2b2', 0.0028980032), 0.5951071679592133

NoHistory+AllOnesBug ('c2b1', 0.30733383), ('a7c7', 0.25477663), ('c2d2', 0.19509505), ('c2d3', 0.17735933), ('c2d1', 0.037348717), ('c2b3', 0.02388807), ('c2b2', 0.004198352), 0.9999998211860657

NoHistory+NoBug ('a7c7', 0.99980253), ('c2b1', 6.103614e-05), ('c2d3', 4.706335e-05), ('c2b3', 3.6989695e-05), ('c2b2', 2.2621784e-05), ('c2d2', 1.6375083e-05), ('c2d1', 1.3423687e-05), 0.6152948960661888

Now, at network 294, this is the current situation (ignoring buggy input plane, as it's no longer relevant): History+NoBug ('c2d3', 0.32457772), ('c2b1', 0.19262017), ('c2d1', 0.15003791), ('c2b3', 0.12282815), ('c2d2', 0.10260171), ('c2b2', 0.08874603), ('a7c7', 0.018588383), 0.46542854234576225)

NoHistory+NoBug ('a7c7', 0.99916804), ('c2b1', 0.00017883514), ('c2d1', 0.00016860983), ('c2d3', 0.00016126267), ('c2b2', 0.00012590773), ('c2d2', 0.00010842814), ('c2b3', 8.8898094e-05)]), 0.43435238301754)

gyathaar commented 6 years ago

Does it still blunder in those positions if you use --fpu_reduction=0.01 (instead of default 0.1) ?

apleasantillusion commented 6 years ago

In the game nelsongribeiro posted, the same pattern holds true (tested with 292).

With history, she plays 124.Ke7 with a very high probability from policy (84.89%), and the response Qxd5 just taking the hanging queen is given only a 2.93% from policy.

Without history at the root, just FEN, she again plays Ke7 with high probability from policy (95.83%), and the Qxd5 response taking the hanging queen is given only 2.33% from policy.

With the FEN modified in only one way, setting 50-move rule counter to 0, Ke7's policy drops to 37.34%, and Qxd5 after Ke7 jumps to 95.07%

Now, from a purely objective standpoint in this particular position, none of this matters so much, since the position is losing to begin with, although forcing black to find the winning idea in the king and pawn ending is a much stronger way of playing than just hanging the queen.

Also, independently of that, the fact that taking a hanging queen is only ~2% from policy when the 50-move rule counter is high is a bit disturbing and is in line with the other examples I cited above.

In general, the variation in probability for Qxd5 based on the 50-move rule counter is quite odd.

In that exact position with black to move (6q1/4K2k/6p1/3Q1p1p/7P/6P1/8/8 b - - 0 0), here are probabilities for Qxd5 with different values of 50-move rule counter:

0: 68.26% 1: 76.71% 5: 89.40% 10: 91.63% 20: 92.48% 30: 94.28% 40: 89.57% 50: 77.83% 60: 83.95% 70: 52.39% 80: 66.84% 90: 11.43% 99: 1.06%

nelsongribeiro commented 6 years ago

The really bad move is the move made just before that position: The FEN position is (4K3/6qk/3Q2p1/5p1p/7P/6P1/8/8 w - - 94 123).

Its a draw at this point, on move 123, ID 292 played 123. QD5 intead of 123. QE6

Last time that a pawn was moved was at 75...f5 , what makes this the 48th move after that.

EDIT: best move was wrong before..

apleasantillusion commented 6 years ago

Just to add to this, the Rxf3+ that's being tracked in the sheet at https://docs.google.com/spreadsheets/d/1884-iHTzR73AgFm19YYymg2yKnwhtHdaGyHLVnffLag/edit#gid=0 shows the same behavior.

With net 297, probability with various 50-move counter rule values:

0: 0.07% 10: 1.88% 25: 10.23% 50: 1.11% 90: 1.72% 99: 6.97%

That's some heavy variation just from changing the 50-move rule counter.

Also, the pattern is different with this one. In all the others, probabilities were worst at the very high counts, a bit better at very low counts, and best at counts around 30. Here that last trend maintains, but the other is more muddled.

ASilver commented 6 years ago

I don't know if it is a consequence of the bug, or the new PUCT values which inhibit tactics, but the latest versions (I am watching 303 right now) have some appallingly weak ideas of king safety and closed positions. I am playing a match against id223 at 1m+1s and it is more than a tactical weakness issue, it is one of completely wrong evaluations, which 223 did not have, that is leading it to happily let its king be attacked until it is too late to save itself. I also saw more than one case where it thought a dead drawn blocked game, with no entry or pieces, was +2, while 223 thought it about equal. The result was that 303 preferred to sacrifice a pawn or two to not allow a draw, and then lost quickly thanks to the material it gave away.

eval303vs223-01

id303 is white, and id223 is black. Both are playing with v10 (different folders) with default settings.

LC0fan commented 6 years ago

@ASilver I have watched many games from many of the CCLS gauntlets, and my overall view (from the perspective of spectator) is that her style changed markedly following the release of v0.8. In particular, she started showing: Type 1: unstable or poor evaluations, and inflexible play.
Type 2: "buggy-looking" play in closed positions in which both engines are shuffling, in repeated positions (especially 3-fold), and in positions where the 50-move rule is key. In my view, Type 1 has "followed the trajectory of the value head", whereas Type 2 has seemed to persist even as "the value head has partially recovered".

so-much-meta commented 6 years ago

Regarding the rook blunder above: position fen 7r/8/R7/3k4/8/8/2K5/8 w - - 77 117 moves a6a5 d5e6 a5a6 e6f5 a6a5 f5f4 a5a7 h8c8 (or just "2r5/R7/8/8/5k2/8/2K5/8 w - - 85 121")

I did some more analysis. I think some here might find it useful. For one, I didn't look closely at rule50 when I looked at this before. Also, I found evidence that filling in history panes with fake data (just copy current position, or oldest if some history is available) is probably best in case no history is available. Check out the attached PDFs with graphs.

What I did was I iterated the halfmove clock (rule50) from 0 through 91, created the FEN position at each halfmove clock, played each of the 8 moves, and determined the network value and policy under the following conditions: (1) Normal History (2) Fake History [All history planes are just copied from the first] (3) No History [All history planes are set to 0]

I did this with both net 288 and net 303. So that's 12 graphs. (halfmove clock is horizontal axis).

I also added the move a7c7 (the blunder move - where Leela does not want to take the free rook, so blunders again in the PV), and did the exact same process.

Some findings:

The blunder is strongly correlated with the halfmove clock.
The situation is clearly getting better going from 288 up to 303
Fake history (as opposed to all zeroes) would be very advantageous here, if no history is available.
(Graphs not attached for this) -- if there is just one move of history, then things usually seem to be evaluated very closely to as if there were the full seven. I.e., Leela only seems to care about the current position and the last position, not two positions ago.

rook_blunder_Id288_Id303.pdf rook_blunder_Id288_Id303_a7c7.pdf

mooskagh commented 6 years ago

The value graphs for network 303 are a bit misleading. It surprised me to have such a drop near ply 80, but then I realized that scale of Y axis is 0.45..0.49.

Would be more demonstrative if all graphs were with 0.0..1.0 scale.

Also, what is the correct move there? (there were several rook blunders in this thread, so it's not immediately clear which one you refer to).

peterjacobi commented 6 years ago

Good examples for the longstanding problem with discovered attacks. Furunkel just posted in #game-analysis, but should be preserved here:

Re-looking at the recent ID300 gauntlet, discoveries are still a huge problem for Leela. In the first 50 games, Leela fell 6 times for a discovered attack. This was not always a game loser but still. I collected those 6 games if people want to use these for testing: https://pastebin.com/mffqHESV The critical moves in those were: game 1: 28.Rb7, game 2:23.Nd5, game 3: 35.Qe4 (ouch), game 4: 29.Rh6 (ouch), game 5: 30.Nc5, game 6: 15..Qd4 (ouch). The 3 I labeled "ouch" are particularly bad.

Also attached: leela300-discoveries.txt

rwbc commented 6 years ago

Interesting blind spot (with mate) again in current ID 304 (from matchplay):

http://lczero.org/match_game/280156

hsntgm commented 6 years ago

Latest ID's have higher elo so i tested ID 303 with 60 min game with stockfish yesterday i watched a whole game.

I was hoping a good result but it seems she still suffer this unknown bug in deep. I am not strong player just 1700 and i simply see especially her position evaluation and simple tactic understanding terribly broken if i compare older networks.So this elo gains is not true.

It seems it will be very difficult to get rid of the past bug effect.She has broken heart.If you believe that you found the bug in algorithm and corrected it maybe you just have to start whole training process again.If the problem is training data and you corrected it i think you have to re-start again because she will have to do two times more training than is necessary to get rid of this effect.

At least simulate it from the beginning with all the training data (from the best clean one we know is ID253) and check the situation.Still something goes wrong and please listen to people.

LC0fan commented 6 years ago

I am just posting here an example that was posted on Discord #technical three hours ago. It is yet another awful 50-move-rule related blunder (the 50-move count starts at move 48 in this example). The blunder is made on move 96 of this game: https://lichess.org/Y1zkwwuv#190

steve3140 commented 6 years ago

Haven't posted here before, but noticed this blunder in a match game. Candidate ID 306 was playing Black in this game:

http://lczero.org/match_game/281647

and blundered into a discovered check with 60. ... Ke6 and dropped a bishop, so the game petered out to a draw.

so-much-meta commented 6 years ago

Alexander - good point about the graph scale. I recognized that, of course, but just kinda had a late night "data is data" attitude... This is the KRkr position where Leela tries to block a rook check with an unprotected rook.

Correct first move is mostly anything but a7c7, the move where Leela tries to give away its rook in net 288 (and only after ~75 halfmoves). But c2d3 is best, moving the king toward the center. (both nets want to do that when halfmoves aren't too big).

Correct second move is c8c7 (taking the unprotected rook). Both ID288 and ID303 think that's a bad idea when halfmoves is big.

In all cases, not having history amplifies the error.

haleysa commented 6 years ago

I have been tracking 3 of the earliest 1-move tactics misses in this thread for the last few nets, and also looked into them to see what's going on under the covers a bit.

Position 1 - The missed pin https://lichess.org/e07JvP6g#43 Leela is totally blind to the mate in 1 threat, presumably because she thinks the mating square is covered by the d2 rook, but the rook is pinned. The problem is not the policy head on the blunder move; actually, Leela's policy prefers the #1 SF move of Kb1. However, the mating move Qxc2# has such a low policy on any of the many blunder moves Leela might make, it takes 30000+ nodes down each "blunder move" to realize they are bad and start preferring a new move. More search won't solve this easily; it would take a lot of nodes. At 100k Leela still blunders, just with a different move Nb6 or Qxb4. The policy for Qxc2# has gone from 0.13% at ID297 to 0.21% at ID308. It's slow improvement, if at all.

Position 2 - Discovered attack with check https://lichess.org/mbWjiT93#105 Leela is blind to the threat that was created of R6g+ and a discovered attack on the black queen. In ID297 at 10k nodes Leela prefers a different losing move, b4; but by ID302 at 10k the preferred move is the SF best move of Qf3 - close enough that it's searchable at a reasonable time control. The policy for the refutation move Rg6+ has gone from 0.43% at ID297 to 0.39% at ID308. It's gotten as high as 0.70% in ID302; ID308 was a downswing.

Position 3 - Missed mate threat after capture https://lichess.org/FI3y76b0#43 Leela needs to stop mate threat of Qh2. Both Ng4 and Nf1 guard the h2 square, but Ng4 allows the knight to be captured by the rook. I believe Leela puts this as a low probability because the knight is guarded, so it would be dropping the exchange - except of course it's dropping mate elsewhere instead. Here the policy for Ng4 is very high - about 90% - and the policy for the Rxg4 refutation is very low, 0.09 for ID297 up to 0.14 for ID308. 10k nodes is not enough to prevent the blunder or find the refutation, but 100k nodes is.

Also in this thread that I found interesting, but haven't tracked closely, is a 2-move tactic where Leela drops the queen: https://lichess.org/efi0R82j#43 Leela takes the pawn with Qxg7 and her queen gets trapped with Rg6. It's actually a miss of both position 2's type and position 3's type - the two ways her queen could try to get out are Qxh7; this is met with a discovered attack on the queen with check of Rg2+; and Rxe5 threatening the black queen; this is met with Rxg7 and if white responds by taking Rxe6 then black has Rxe1# (this is the path taken in the game, she avoids mate but drops the queen). Being 2 ply deeper in search, and with two "good" moves that have a hard to spot refutation, this takes a huge amount of search to avoid (more than 100k and I suspect a lot more). But this should go away once both of the other issues are learned.

rwbc commented 6 years ago

@haleysa good summary!

steve3140 commented 6 years ago

If this would benefit from an ID, let's call it MultiMove#1.

I'm also interested in multi-move tactics and noticed one when I was playing some matches between NN226 and NN300 last night. It involves a couple of themes where I've noticed Leela making some bad mistakes recently, which in this example are:

theme: removal of defender (capturing a piece which defended the queen, leaving the queen hanging); combined with
theme: discovered check (knight gives king check, revealing discovered attack against the now-hanging queen)

Here's the moves of the game: LC Zero 226 GPU (2800) - LC Zero 300 GPU (2800) [B51] 1.e4 c5 2.Nf3 d6 3.Bb5+ Nd7 4.Bxd7+ Bxd7 5.d3 e6 6.0-0 Ne7 7.d4 Ng6 8.dxc5 dxc5 9.Nc3 Be7 10.Qe2 Qc7 11.Rd1 0-0 12.Nb5 Bxb5 13.Qxb5 Rfd8 14.Bd2 a6 15.Qe2 b5 16.g3 Qb7 17.h4 Nf8 18.b3 b4 19.c4 a5 20.a4 bxa3 21.Rxa3 Ra7 22.Rda1 Rda8 23.Kg2 Nd7 24.Bc3 Nb8 25.Rxa5 Rxa5 26.Rxa5 Rxa5 27.Bxa5 Qxb3 28.e5 Qb7 29.Bc3 Na6 30.Kh2 Nb4 31.Nd2 h6 32.Ne4 Qd7 33.Qe3 Qc6 34.Qf4 Nd3 35.Qf3 Nb4 36.Bxb4 cxb4 37.Nf6+ gxf6 38.Qxc6 fxe5 39.Qe8+ Bf8 40.c5 b3 41.c6 b2 42.c7 b1Q 43.c8Q Qb4 44.Qcd7 Qe7 1-0

Time control was Blitz 2m+1sec/move. All engine settings were defaults.

The tactics are at move 35 where NN226 playing White lays a 'trap' with 35.Qf3 and NN300 as Black stumbles into it with 35...Nb4 (best/saving move = 35...Qd7).

FEN for position just before Nb4 is: 6k1/4bpp1/2q1p2p/2p1P3/2P1N2P/2Bn1QP1/5P1K/8 b - - 0 35

Checked some other versions of Leela in Chessbase Reader as an analysis engine: NN253 finds 35...Nb4 problematic after 6kN and eventually recommends 35...Qd7 after 16kN.
NN311 recommends 35...Nb4 and doesn't see any problems until 29kN have been searched (on my hardware that's 4 mins). NN311 eventually locates 35...Qd7 after 59kN (7mins50sec). After 35...Nb4 is played (as per game), NN311 recommends 36.Qg4 for White and persists with this recommendation until I gave up the search after about 10 mins. After 36.Bxb4 is played (as per game), NN311 then sees the tactic within 1kN and moves the queen away.

haleysa commented 6 years ago

Update: Position 1 - the missed pin, ID 316 has improvement. The refutation is up to N 0.34 and is now found in under 10k nodes; the initial blunder, and related blunders from the same position, are now avoided at 100k nodes and Leela plays the SF preferred move of Kb1 to avoid mate and retain advantage. Comparing this to ID 227, for those considering a rollback - the refutation was as low as 0.05 and even at 100k nodes was not visited even once. A lot of progress made on that over the last ~90 nets.

The other two positions I am tracking are pretty similar in place to where they were at ID 227.

steve3140 commented 6 years ago

Tested NN316 against MultiMove#1 above. FEN = 6k1/4bpp1/2q1p2p/2p1P3/2P1N2P/2Bn1QP1/5P1K/8 b - - 0 35

NN316 still recommends the losing 35...Nb4 initially. She find the saving 35...Qd7 after 19kN. Improvement (NN311 took 29kN).

After 35...Nb4 as in the game: NN316 still recommends 36.Qg4 and doesn't see the winning 36.Bxb4 until 87kN (takes 9:48 on my hardware). Improvement (NN311 took longer and I never even completed that test, having given up at 10mins previously).

After 36.Bxb4 as in the game: NN316 still recommends 36...cxb4 and sees the check and discovered attack (37.Nf6+ and 38.Qxc6) then finds 36...Qc7 after 3kN.

Changed PUCT from 0.60 to 0.85 and repeated the tests. 35...Qd7 was found after 10kN (was 19kN) 36.Bxb4 was found after 23kN (was 87kN) 36...Qc7 was found only after 22kN (was 3kN) ... to me, this was surprisingly worse.

I have a question about FPU Reduction. On my setup in Chessbase Reader, FPU Reduction = 0.10 Is this the "correct" current setting, or is this a hangover from an earlier setup on my environment?

steve3140 commented 6 years ago

So I finally spent a few minutes learning about the CLI after seeing someone else's post, so I've worked out how to get more detailed info. Here's a comparison of that MultiMove#1 position from a few different versions:

lczero.exe -w position fen 6k1/4bpp1/2q1p2p/2p1P3/2P1N2P/2Bn1QP1/5P1K/8 b - - 0 35 go movetime 1000

NN253 info string Qd7 -> 15 (V: 34.55%) (N: 27.76%) PV: Qd7 Nd6 Bxd6 Qxd3 Qb7 exd6 info string Nb4 -> 81 (V: 45.14%) (N: 40.43%) PV: Nb4 Qg4 Nd3 Qe2 Nb4 Qg4 Nd3 Qe2 info string stm Black winrate 42.47%

NN292 info string Qd7 -> 22 (V: 33.95%) (N: 33.45%) PV: Qd7 Nd6 Nxe5 Bxe5 Bxd6 Qd3 Qc6 Qxd6 Qf3 info string Nb4 -> 67 (V: 45.35%) (N: 20.13%) PV: Nb4 Qg4 Nd3 Nf6+ Kh8 Nh5 Bf8 info string stm Black winrate 40.59%

NN311 info string Qd7 -> 19 (V: 38.07%) (N: 40.24%) PV: Qd7 Nd6 Nxe5 Bxe5 Bxd6 Qd3 Qc6 info string Nb4 -> 41 (V: 49.85%) (N: 24.53%) PV: Nb4 h5 Qa6 Nd6 Bxd6 exd6 Qxd6 Qa8+ info string stm Black winrate 44.86%

NN316 info string Qd7 -> 17 (V: 32.87%) (N: 41.50%) PV: Qd7 Nd6 Nxe5 Bxe5 Bxd6 Qd3 Qc6 Qxd6 info string Nb4 -> 77 (V: 48.53%) (N: 21.11%) PV: Nb4 Qe3 Nc2 Qd3 Nb4 Qe2 Qa6 Nd6 info string stm Black winrate 44.36%

At a longer 1 minute (movetime 60000) search, NN316 still recommends the losing move Nb4 after 1600 nodes: info string Qd7 -> 69 (V: 21.95%) (N: 41.50%) PV: Qd7 Nd6 Nxe5 Bxe5 Bxd6 Qd3 Qc6 Qxd6 Qf3 Qxc5 f6 Bd4 e5 Be3 Kh7 info string Nb4 -> 4014 (V: 44.92%) (N: 21.11%) PV: Nb4 Qg4 Kh8 h5 Bf8 Nd6 Kg8 Qe4 Qd7 Qa8 Nd3 f4 Nf2 info string stm Black winrate 44.21%

With a 10 minute (movetime 600000) search, NN316 finds the best move Qd7 after 24000 nodes: info string Nb4 -> 9609 (V: 35.46%) (N: 21.11%) PV: Nb4 Qg4 Kf8 Nd6 Kg8 Ne4 Kf8 Nd6 Kg8 Ne4 info string Qd7 -> 46166 (V: 35.93%) (N: 41.50%) PV: Qd7 Nd6 Nxe5 Bxe5 f6 Qa8+ Bf8 f4 h5 Kh3 fxe5 fxe5 g6 Kg2 Kg7 Qa6 Kh7 Qb6 Bg7 Qxc5 Qa4 Qd4 Qc2+ Kf3 Qb3+ Kf2 Qc2+ Ke1 Qc1+ Ke2 Qc2+ Kf1 Qc1+ Kg2 Qc2+ Kf3 Qc1 Ne4 Qf1+ Ke3 Qg1+ info string stm Black winrate 35.77%

Question (from one who is ignorant): Given a 41.50% vs 21.11% differential, why was Qd7 visited so little to start with?

mooskagh commented 6 years ago

@steve3140 It means that after trying to make that move, the output of value head was so low that it decided that prior is wrong.

Also it's known to have especially bad result when no-capture counter is around 30-40, and when there is no history in fen position. It seems that in your tests you had both.

haleysa commented 6 years ago

For reference on the 3 positions I'm tracking - the no-capture counter is under 10 on all three positions, and I'm testing by putting the full game history in with "position startpos moves ..." and then "go nodes ..." As of ID327, there's been no significant improvement; the refutation move is still under 1% N in all three cases, and generally under 0.5%.

ASilver commented 6 years ago

What if the PUCT and FPU values were changed? Have you tested?

On Tue, May 22, 2018 at 11:41 AM, haleysa notifications@github.com wrote:

For reference on the 3 positions I'm tracking - the no-capture counter is under 10 on all three positions, and I'm testing by putting the full game history in with "position startpos moves ..." and then "go nodes ..." As of ID327, there's been no significant improvement; the refutation move is still under 1% N in all three cases, and generally under 0.5%.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/558#issuecomment-391016147, or mute the thread https://github.com/notifications/unsubscribe-auth/ADbG1366pWuuCpMmTuflhHEwK7Sed7qLks5t1COlgaJpZM4T2JIS .

haleysa commented 6 years ago

No, but I don't see much reason to test - of course modifying PUCT and FPU will help, finding the refutation moves is a matter of searching for a single move that immediately changes the valuation significantly. The first refutation position is mate in 1; as soon as the search starts touching the position, it elevates it to best move pretty much immediately. But the policy head is so low in some nets it's basically tied for last choice of moves. Tuning PUCT/FPU to make sure we touch all moves in under 1000 nodes is not a usable solution, the policy head needs to know these refutations are good moves.

so-much-meta commented 6 years ago

Reverting PUCT/FPU might help in training (to get the policy head to see the correct move so that it's correctly trained in the first place)

On Tue, May 22, 2018 at 11:05 AM haleysa notifications@github.com wrote:

No, but I don't see much reason to test - of course modifying PUCT and FPU will help, finding the refutation moves is a matter of searching for a single move that immediately changes the valuation significantly. The first refutation position is mate in 1; as soon as the search starts touching the position, it elevates it to best move pretty much immediately. But the policy head is so low in some nets it's basically tied for last choice of moves. Tuning PUCT/FPU to make sure we touch all moves in under 1000 nodes is not a usable solution, the policy head needs to know these refutations are good moves.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/558#issuecomment-391025437, or mute the thread https://github.com/notifications/unsubscribe-auth/Ak11CW4Cg9zriL4kmmj4TQm-IADYzjqnks5t1ClIgaJpZM4T2JIS .

ASilver commented 6 years ago

@haleysa You don't need to touch all moves, but there is a difference between touching all and not enough. What if widening it led it to find more correct moves, and thus learn from them, and thus... improve?

glinscott / leela-chess

Analyze blunders #558

Important!

(old text below)

Help wanted: