suragnair / alpha-zero-general

A clean implementation based on AlphaZero for any game in any framework + tutorial + Othello/Gobang/TicTacToe/Connect4 and more
MIT License
3.74k stars 1.01k forks source link

updateThreshold #226

Open Vovak1919 opened 3 years ago

Vovak1919 commented 3 years ago

I started researching the alpha-zero-general algorithm, but I found this parameter in the main.py module

'updateThreshold': 0.6, # During arena playoff, new neural net will be accepted if threshold or more of games are won.

And this is the coach.py module

def learn (self): "" " Performs numIters iterations with numEps episodes of self-play in each iteration. After every iteration, it retrains neural network with examples in trainExamples (which has a maximum length of maxlenofQueue). It then pits the new neural network against the old one and accepts it only if it wins> = updateThreshold fraction of games.

Are there any discrepancies with the original description of "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm"?

In AlphaGo Zero, self-play games were generated by the best player from all previous iterations. After each iteration of training, the performance of the new player was measured against the best player; if it won by a margin of 55% then it replaced the best player and self-play games were subsequently generated by this new player. In contrast, AlphaZero simply maintains a single neural network that is updated continually, rather than waiting for an iteration to complete.

Did I understand correctly that this is the AlphaGo Zero algorithm, but not AlphaZero?

suragnair commented 3 years ago

Yep you are correct! Check #137 #74. This repo was actually based on AlphaGo Zero.

Vovak1919 commented 3 years ago

Yep you are correct! Check #137 #74. This repo was actually based on AlphaGo Zero.

Thanks! What do you say about this difference between AGZ and AZ?

AlphaGo Zero tuned the hyper-parameter of its search by Bayesian optimisation. In AlphaZero we reuse the same hyper-parameters for all games without game-specific tuning. The sole exception is the noise that is added to the prior policy to ensure exploration (29); this is scaled in proportion to the typical number of legal moves for that game type.

Vovak1919 commented 3 years ago

Another question. What do you say about the file pseudocode.py from Supplementary Materials if this topic is still relevant to you?