Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

mokemokechicken / reversi-alpha-zero

Reversi reinforcement learning by AlphaGo Zero methods.

MIT License

677 stars 169 forks source link

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm #13

Open mokemokechicken opened 6 years ago

mokemokechicken commented 6 years ago

FYI: https://arxiv.org/abs/1712.01815

mokemokechicken commented 6 years ago

The rules of Go are invariant to rotation and reflection. This fact was exploited in AlphaGo and AlphaGo Zero in two ways. First, training data was augmented by generating 8 symmetries for each position. Second, during MCTS, board positions were transformed using a randomly selected rotation or reflection before being evaluated by the neural network, so that the MonteCarlo evaluation is averaged over different biases

Oh..., I did't generate 8 symmetries for each position...

mokemokechicken commented 6 years ago

Dirichlet noise Dir(α) was added to the prior probabilities in the root node; this was scaled in inverse proportion to the approximate number of legal moves in a typical position, to a value of α = {0.3, 0.15, 0.03} for chess, shogi and Go respectively.

In reversi, it is better that α is 0.3 ~ 0.5?

mokemokechicken commented 6 years ago

Illegal moves are masked out by setting their probabilities to zero, and re-normalising the probabilities for remaining moves.

re-normalising in legal moves may be important because of balance between value and policy.

Zeta36 commented 6 years ago

In chess, AlphaZero outperformed Stockfish after just 4 hours (300k steps)

Wow!!

gooooloo commented 6 years ago

In reversi, it is better that α is 0.3 ~ 0.5?

Agreed. Let's say 180 legal actions in average in Go19x19, and in Reversi it may be around 10? So as to the new paper, 10 times 0.03 seems more reasonable.

apollo-time commented 6 years ago

What is main different between alphago zero and alphazero? Is same the MCTS architecture?

mokemokechicken commented 6 years ago

Hi @apollo-time

I think the main differences are as follows.

P3~4

AlphaZero:

AlphaZero does not augment the training data and does not transform the board position during MCTS. (for generality)
evaluation step is omitted. self-play is performed by the newest model parameters. (!)
didn't tune hyper-parameter by Bayesian optimization. (reuse past parameters except policy noise)

So, MCTS is also used without transforming the board position.