Closed will-iam closed 6 years ago
Which temperature parameter do you mean?
1) In AZ, they set the temperature to a fixed 1 for the move selection in self-play. The engine chooses proportionally to the visit count. I have no idea why you think the temperature should decrease with the length of the game. If it's only to ensure divergence (more than to increase exploration), that would be reasonable (and match the Alpha Zero Go that originally had t=1 for the first 30 moves only). But exploration is good!
2) There is a cfg_softmax_temp that acts as an operator on the Network outputs. The main use is to allow some further tuning after the best network has been established. It also interacts with the UCT parameter.
Thank you, I mixed up the two parameters. Now referring to the first one: "For the first 30 moves of each game, the temperature is set to τ = 1; this selects moves proportionally to their visit count in MCTS, and ensures a diverse set of positions are encountered. For the remainder of the game, an infinitesimal temperature is used, τ→0" , I understood that deep in the search, the temperature should decay. Sorry for being a beginner, what do you mean when you say that it ensures divergence ? and why would it be reasonable ?
I understood that deep in the search, the temperature should decay.
This parameter has nothing to do with the search or search depth. It is applied to the final search output. (And it is constant = 1 in AZ, instead of variable in AZ Go)
what do you mean when you say that it ensures divergence ? and why would it be reasonable ?
The idea is that generating more self-play games only helps if they are different. In AZ Go, there was additional randomness from rotating the board randomly, which is not present in chess. If you are only interested in playing different games (instead of also exploring moves the current network considers less good), it is reasonable to only do the randomization early on. At a certain point, the game will have diverged already.
Thanks a lot!
The infinitesimal temperature τ→0 refers to a formula in the Alphago Zero paper, which sets the move probability (before normalization) as N^(1/τ). For τ=1, this means move probability proportional to visit count, for τ→0 it means greedy selection, i.e. move with highest visit count is always selected. τ→0 is a mathematical convention, since you're not allowed to divide by zero.
In the current self-play implementation, every chosen move is the best, right ? How do we ensure divergence then ? Dirichlet noise is enough to avoid that we always produce the same game over and over ? Maybe there is another random part in the search but I don't see it.
In Leela Zero, additional randomisation is provided by the application of a random symmetry (rotation/reflection) to the board before network eval. That is harder to do in chess, but may be possible, see https://github.com/glinscott/leela-chess/issues/25. If not, a temperature larger than 0 will provide some degree of randomness. Alpha Zero actually uses τ=1 for self play, so there's plenty of divergence there.
Thanks, and #28 answers my question too.
Maybe a stupid and useless question but what's the point of the temperature parameter since we always choose the best move with the visit count ? Moreover, if we decide the choose randomly the move with the distribution probability of the policy, should the temperature not decrease (with the respect of the length of the game) ? I understand that https://github.com/Zeta36/chess-alpha-zero does this.
Thank you everyone for contributing to this amazing and dynamic project! Can't wait to see it play!