Open mokemokechicken opened 6 years ago
The rules of Go are invariant to rotation and reflection. This fact was exploited in AlphaGo and AlphaGo Zero in two ways. First, training data was augmented by generating 8 symmetries for each position. Second, during MCTS, board positions were transformed using a randomly selected rotation or reflection before being evaluated by the neural network, so that the MonteCarlo evaluation is averaged over different biases
Oh..., I did't generate 8 symmetries for each position...
Dirichlet noise Dir(α) was added to the prior probabilities in the root node; this was scaled in inverse proportion to the approximate number of legal moves in a typical position, to a value of α = {0.3, 0.15, 0.03} for chess, shogi and Go respectively.
In reversi, it is better that α is 0.3 ~ 0.5?
Illegal moves are masked out by setting their probabilities to zero, and re-normalising the probabilities for remaining moves.
re-normalising in legal moves may be important because of balance between value and policy.
In chess, AlphaZero outperformed Stockfish after just 4 hours (300k steps)
Wow!!
In reversi, it is better that α is 0.3 ~ 0.5?
Agreed. Let's say 180 legal actions in average in Go19x19, and in Reversi it may be around 10? So as to the new paper, 10 times 0.03 seems more reasonable.
What is main different between alphago zero and alphazero? Is same the MCTS architecture?
Hi @apollo-time
I think the main differences are as follows.
AlphaZero:
So, MCTS is also used without transforming the board position.
FYI: https://arxiv.org/abs/1712.01815