Open zhengrongpeter opened 3 years ago
The alpha-zero agent can learn simple games, such as TicTacToe very quickly, to the point where it has solved them after only a few iterations. In the case of TicTacToe, I found that the agent learns that, against an optimal player, it can never win. This leads to the agent playing only moves that would result in a draw and it doesn't seem to try to win. One solution to this, in games where the first or second player is guaranteed a win if they play optimally, such as Connect4, is to introduce a decay in the reward. This would mean that the longer the agent doesn't lose, the higher it scores. Another solution, in the case of games guaranteed to end in a draw, is to train the agent against a sub-optimal or even random player. However, this would be a divergence from the Alpha-Zero paper.
Yeah, the predicted state value for the first player is really high, but the value for the second player is really low. It's sufficient to say the second player cannot win if the first player is an expert. Alpha-zero might think this game is too unequal, so it's unnecessary to train anymore... Also, after many iterations, alpha-zero for tictactoe will draw frequently, so if the earlier training data are removed, the model might forget some tictactoe records for beginners. I think it does matter if a beginner is the first player and alpha-zero is the second player since the second player has only trained with an expert who always starts the game in a certain sequence of moves... I think playing with a random player is a simple but valid solution. Probably for Go or Chess, it won't be a problem.
Yes, maybe a good training method for TicTacToe would be to to play fewer games per iteration (maybe around 20?) as there are far fewer possible states. I don't think it would take more than 20 iterations for the model to solve the game, even with this data, so the model would still be training on "beginner" data.
How about adding a little bit of noise to the first few moves of a game to randomize each opening? This may lead to it knowing how to play all types of openings and not just optimal ones
What I mean is that after a few iterations of self-play and training, I think MCT will converge and always place the first piece in the center of the 3*3 board (starting from the center has the highest probability to win). Then, future games generated by self-games will always start from the center. Since the early dataset will be removed from the buffer (due to maximum size limitation), the dataset used for training will contain only expert-level games. Since neural networks may forget the early training patterns as the early training data are removed, I think future MCTs will forget how to deal with a beginner who may not put the first piece in the center. If I place the first piece in the corner, and alpha zero is the second player, will alpha zero never lose?