leela-zero / leela-zero

Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper.
GNU General Public License v3.0
5.3k stars 1.01k forks source link

"Exact-Win Strategy for Overcoming AlphaZero" #2276

Open alreadydone opened 5 years ago

alreadydone commented 5 years ago

https://www.researchgate.net/publication/331216459_Exact-Win_Strategy_for_Overcoming_AlphaZero

tl;dr Modified MCTS algorithm has a 61% winrate over the original (in 100 games) using the same network.

I think the abstract should mention they didn't train any Go network, let alone independently. Search algorithm is only a part of the AZ approach, merely improving it cannot be considered to be overcoming AZ. From the parts I've read (mostly shown below, emphasis mine) it doesn't sound like the same approach as Lc0's certainty propagation.

Abstract. The Monte-Carlo Tree Search used in the AlphaZero may easily miss a critical move because it is based on sampling search space and focuses on the most promising moves. In addition, the Monte-Carlo Tree Search may sample a move for many times even if this move has been explored with a determined game-theoretical value. In this paper, we propose an Exact-win-MCTS that makes use of sub-tree's information (WIN, LOSS, DRAW, and UNKNOWN) to prune unneeded moves to increase the opportunities of discovering the critical moves. Our method improves and generalizes some previous MCTS variations as well as the AlphaZero approach. The experiments show that our Exact-win-MCTS substantially promotes the strengths of Tic-Tac-Toe, Connect4, and Go programs especially. Finally, our Exact-win Zero defeats the Leela Zero, which is a replication of AlphaZero and is currently one of the best open-source Go programs, with a significant 61% win rate. Therefore, we are pleased to announce that our Exact-win-MCTS has overcome the AlphaZero approach without using extra training time, playing time, or computer resources. As far as we know, this is the first practical idea with concrete experiments to beat the AlphaZero approach.

Smartly, they called for lots of volunteers' computers to continuously train the weights of their neural network. Because of the limitation of the resource and time, we also make use of the weights already trained by Leela Zero. We use the newest weights of Leela Zero on 2018-08-04. The version of this weights is with a hash code b0841a68b4beae9314e47757b44dbe16c5d421dfadcba9ddfe25373e8c27b262.

LZ # 161 | 2018-08-04 18:50 | b0841a68 | 20x256

In this work, we apply the Exact-win-MCTS to the Leela Zero, named Exact-win Zero, to compare with the original Leela Zero. We manually design an interesting endgame of Go (see Figure 4) to observe the behavior of the Leela Zero and Exact-win Zero respectively. Figure 5 shows that the Leela Zero moves gradually towards the right win rates' estimation for both Black and White sides starting from the 27th move. That means the Leela Zero sees the result of this endgame after the 27th move. However, our Exact-win Zero sees the result of this endgame after the 11th move, as shown in Figure 6. It means that the horizon effect of the Leela Zero is coming much earlier than our Exact-win Zero. In other words, our Exact-win Zero can search much deeper than the Leela Zero.

Finally, we let our Exact-win Zero play 100 games against the Leela Zero starting from an initially empty board, each being the Black player for 50 games. In general, two programs with the same strength (weights) playing to each other would get a fifty-fifty result. But our Exact-win Zero wins 61 games among 100 games against the Leela Zero. Hence, the win rate of our Exact-win Zero is 61% as shown in Table 5. The result is trustworthy with the standard deviation greater than 2 times of 100 games. It ensures that our Exact-win-MCTS is very effective in improving the strength of AlphaZero approach.

In the experiment of Table 5, no matter who (the Exact-win Zero or the Leela Zero) plays Black or White, the win rates of Black and White are 49% and 51% respectively. This makes sense because White has a little more advantage on Go because of the 7.5 komi. Among the 100 games, our Exact-win Zero playing Black (resp. White) gets a 60% (resp. 62%) win rate, as shown in Figure 7. Indeed, our Exact-win Zero always outperforms the Leela Zero no matter it plays Black or White.

Among these 100 games, every game was over with at least 151 moves and with at most 416 moves. According to how many moves were played for each game, we classify these games into 4 sets: less than 201 moves, 201-250 moves, 251-300 moves, and more than 300 moves. There are 18, 31, 24 and 27 games in the set of games with less than 201 moves, 201-250 moves, 251-300 moves, and more than 300 moves respectively. Our Exact-win Zero wins 11, 20, 15 and 15 games in these 4 sets respectively. The win rates of our Exact-win Zero vs. the Leela Zero for each set are shown in Figure 8. The results show that our Exact-win Zero plays stronger in each set with a win rate of at least 55.6%. From the algorithmic point of view, our approach takes advantage of avoiding re-sampling those proven nodes with exactly win or loss results and can explore more critical nodes in the game tree. Hence, the Exact-win Zero can see clearer than the Leela Zero at the uncertain opening and middle stages of the games. At the endgame stage, the game results are more certain and both the Exact-win Zero and the Leela Zero are less likely to make the wrong moves. But even so, our Exact-win Zero can still see clearer than the Leela Zero since our Exact-win Zero gets a 55.6% win rate in the set of games with more than 300 moves.

The promising results ensure that our Exact-win Zero defeats the Leela Zero with a significant 61% win rate. In conclusion, our Exact-win-MCTS has already overcome the AlphaZero approach. It is also possible to apply our method for the training stage of AlphaZero. However, this needs lots of computer resources, such as GPUs, TPUs, computers, and lots of time to train the neural network to achieve a top-level weight like AlphaGo Zero's weight. This deserves to be tried and tested in the future. By using our Exact-win-MCTS for the training stage, we expect that the final weights of the neural network will be much better than previous weights made by those top-leveled organizations such as Deepmind, Leela, Facebook, or Tencent.

l1t1 commented 5 years ago

good, did they publish the sgf?

alreadydone commented 5 years ago

@zxkyjimmy appears to be one of the author, who is a contributor of https://github.com/Tencent/PhoenixGo/search?q=zxkyjimmy&type=Commits

gcp commented 5 years ago

Does someone have the full paper so we can evaluate the algorithm? The choice of networks is almost a year old, so their comparison would have missed any improvements we've made in the last year, but it might still work. Needs testing.

Ishinoshita commented 5 years ago

@alreadydone You're teasing our curiosity! You didn't quote any paragraph that explains how it actually works ;-)

But thank you for the efforts to make a digest, until the paper is on arxiv.

roy7 commented 5 years ago

I remember my simple max_lcb change was getting a 55% win rate in most tests if timemanage was off. I've always been confident improvements to search must exist since MCTS isn't really designed for NN evals. The question is who will find the better search algorithms and publish them. :)

Ttl commented 5 years ago

The algorithm seems to be basically: If node has child with a winning evaluation mark it as losing and don't visit it again. Provably winning or losing nodes in Go only happen after double pass, so I find it very surprising that this would result in strength increase.

exact_win

Ishinoshita commented 5 years ago

@Ttl, @alreadydone Do you think there is a typo error again (after the typo in the sqrt in KataGo paper ;-) or did they actually changed DM formula? In which case we are not comparing apples to apples.

Edit: I definitively ought to read the paper before making any comment ;-). Clearly, there are plenty of things I do not grasp based on your excerpts: there is no policy term, they are doing full rollouts, etc...

alreadydone commented 5 years ago

I downloaded it from https://dl.acm.org/ft_gateway.cfm?id=3293486&ftid=2040243&dwn=1&#URLTOKEN# which may be behind a paywall. sci-hub exists though.

zxkyjimmy commented 5 years ago

First of all, thank all those who are interested in this paper. I love this community so much.

@alreadydone Yes, I am the author of this paper. As @Ttl said, we distinguish the nodes that already know the result, and then don't revisit them. Such a change may cause an additional burden on the original MCTS, so we judge whether to update the tag by recording the number of unknown children instead of checking the labels of all child nodes.

I have released 100 game records including SGF files and the complete output logs.

BTW, someone requests the full-text of this paper, but as far as I know, I don't have this permission. If you have a subscription to ACM DL, you can try this link.

Thanks again for your attention.

alreadydone commented 5 years ago

@zxkyjimmy Thanks for chiming in!

The UCB formula as shown deviates so much from the AZ people and there's no explanation of those changes; if there is no typo then it would be unclear whether the improvement is due to node marking/pruning or those changes.

As @Ttl said, there are no other means of getting a certain result unless double-pass happens in the search tree and the score is counted. (If one-ply look-ahead was implemented, then single-pass would be sufficient to get a certain/marked node about half of the time.)

I see in the logs that the pass move is sometimes considered even when winrate is above 30%. I recall that some LZ 20b nets were net2net'd and maybe some are trained from scratch. I wonder whether they are not sufficiently well-trained so that they yield a high policy for passing, and I wonder whether you'll see the same improvement with your algorithm using a current 40b network. https://github.com/leela-zero/leela-zero/issues/2273 may be relevant.

Upon reading https://github.com/LeelaChessZero/lc0/issues/263 it seems that there're a few overlapping ideas in Exact-win and in certainty propagation are pretty similar (MCTS-solver is also mentioned in the Exact-win paper as related work). Maybe you'd try some ideas there.

It looks like the 100 games were played without playouts/visits limits under default time settings. From the logs it seems the Exact-win engine is a modified version of LZ; would you show us the diff in the source code? Did you really implement C. Simulation (random playouts until the end of the game)?

If anyone is interested, there's also an earlier paper joint with the owner of https://github.com/suragnair/alpha-zero-general: https://dl.acm.org/citation.cfm?doid=3278312.3278325 (This paper uses the UCB formula of AGZ).

l1t1 commented 5 years ago

one quesion, the 001-leela.log shows

Using 2 thread(s).
RNG seed: 15332538815659000100
Detecting residual layers...v1...256 channels...20 blocks.

but the paper says you used lz 130, that is a 15x192 weight. 130 2018-05-02 03:06 18e6a6c5 15x192 10710 61580 7100021

l1t1 commented 5 years ago

my mistake, the paper said they used lz161, i confused it with another program ( https://github.com/lightvector/KataGo/releases) 161 2018-08-04 18:50 b0841a68 20x256 11906 19270 9192534

l1t1 commented 5 years ago

The Big Win Strategy on Multi-Value Network: An Improvement over AlphaZero Approach for 6x6 Othello DOI: https://doi.org/10.1145/3278312.3278325

gcp commented 5 years ago

The algorithm seems to be basically: If node has child with a winning evaluation mark it as losing and don't visit it again. Provably winning or losing nodes in Go only happen after double pass, so I find it very surprising that this would result in strength increase.

What about setting bounds around the root score, i.e. root score + 20% and root score - 20%, and saying +20% is a win, and -20% is a loss. A bit like aspiration windows in Alpha Beta search.

But I wonder how many nodes you search like that in the first place, doesn't seem like it should be a lot.