glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
760 stars 299 forks source link

Interaction of UCT and non-uncertain states (can it be improved?) #202

Open Tilps opened 6 years ago

Tilps commented 6 years ago

After reading some of the threads about mate detection, and the general perception that alpha-zero style NN based searches have a habit of not playing the shortest path to win I was left wondering whether for chess (somewhat unlike go) UCT could be improved by special casing winrates that are deterministically evaluated, vs those which come from NN evals. The simplest case is there is a move which is checkmate, the eval for the post-checkmate position is deterministic loss. When running UCT from the parent node, the checkmate move probably has good policy and it has a good win rate, but UCT select is still going to spend a bunch of visits on other options, and if those options very probably lead to a win as well, then its really not super-clear which will end up being selected. UCT ultimately presumes (as far as I understand it) that its sampling from a pool of unknown values, but once a checkmate is in play there is now an aspect of certainty. So my thought was to potentially add an additional term to UCT select to favour the certain win over uncertain probable win. And correspondingly uncertain anything over certain loss. With the idea that for non-self-play scenarios you might set the weight of this additional term high for a bit more confidence that the NN won't go past victory and fall in to a well hidden trap, but also possibly for training we could potentially have it enabled, with a small weight to encourage the NN to choose shorter paths to victory as it would train the policy to prefer victories the UCT search can determine with certainty.

To test things out I forked and added a patch https://github.com/Tilps/leela-chess/commit/d866977538e453816d220b90fac8573177fbd028 - but my testing is showing extremely small improvements in self play with one side having the additional changes and the other side not. Even with very weak starting networks its probably less than a 25 elo difference even with my bonus factor set to 2.0. (I tested both Gen 3 and ID 48) I'm wondering if someone can see some stupid mistake I've made, or justify why this patch is actually useless.

Akababa commented 6 years ago

I wonder if this is a little bit "cheating" in the alpha-zero spirit, because in a general environment maybe you don't know what the maximal reward is. But counter-intuitively, sometimes playing suboptimally in simulations can make MCTS stronger overall, perhaps because you're simulating the opponent's mistakes.

With that being said, maybe you could add uncertainty to the evaluation, like UCT does in some sense.

Tilps commented 6 years ago

It could be considered cheating already that the NN isn't asked to evaluate terminal positions. Also it could be argued that UCT itself is cheating, since it brings years of humans experience with game theory to the table in order to calculate how to mix policy values with visitation counts with ... I think if we're willing to accept hard coded values for valuations of terminal positions, adding additional game theory augmentation to UCT to account for the fact that the UCT is no longer purely dealing with values which come from sampling (or NN estimations of what the outcome of sampling would be as is in our case) is quite reasonable. That being said I still don't have any evidence that its worth doing ;)

Dorus commented 6 years ago

I see this "cheating" stuff pop up all day on this repro. Why are we so afraid of human algorithms?

Adding human inspired moves is bad because it can introduce bad habits into the NN that are hard to train away again.

However, adding MCTS related fixes that are mathematically correct, can be applied to any domain, and are not tailor made for specific board positions but rather for a whole set of them, are not likely to harm the NN ability to learn.

I'm also not sure why it would be harmful to not train terminal positions. The network will never be ran in these positions (since there are no legal moves, the game is over). the network does not internally simulate new board positions, so training it on a position will not teach it how to look for that position one move before. Only during MCTS it might have the value of the latter position propagated back, but even then that value is not directly put into the network, but rather only trained into it as part of the entire MCTS output of the move before. Since terminal positions have a 100% accurate value already, there is nothing left for the NN to add to these positions.

Where i do see value is to train the NN on 2, 3 or 4 piece endgame, simply because the patterns it learns here might help it also find mates with more pieces. However, in tournament settings you might still want to run a 2, 3 or 4 (or more) piece Endgame tablebase just to make the engine play perfect in these situations. Doing so in self play indeed would be human knowledge and would potentially be harmful, but again only very slightly because the NN can still learn to play chess just fine with 5+ pieces.

Propagating terminal state one move back sounds interesting and is going to save the MCTS a few playouts, but nothing more as MCTS should be able to converge to the terminal state in a few more playouts just fine. However i wouldnt be surprised that on a strong NN, the elo difference is going to be neglectable.