AlphaGo Fan and AlphaGo Lee use two neural networks: a policy network that outputs
move probabilities and a value network that outputs a position evaluation.
Combine with Monte Carlo Tree Search (MCTS)
[ ] Pre-train the two networks: Supervised learning, from human expert games.
Then self-play
Different between AlphaGo Zero and AlphaGo Lee
trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data. Second, it
uses only the black and white stones from the board as input features.
uses a single neural network, rather than separate policy and value networks.
uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts.
Combine two neural networks in AlphaGo Lee to one network, output the probability of each action, and a scalar evaluation, estimating the probability of the current player winning from this action.
[PDF]
Different between AlphaGo Zero and AlphaGo Lee