Treat UCT search configuration parameters as hyperparameters

I agree, especially with puct. Since puct adjusts the balance in selecting childs between policy and propagated value, it can have an effect on learning and overfitting. The interactions are complicated, but one argument (just for putting it into words) is: At the moment, one would rather like to learn value via the route policy chooses than vice versa as the value head does not generalize well. The change from 0.85 to 0.6 could have amplified the current oversampling problems.

My view on puct is that it is more a parameter related to learning rather than one that should be tuned "outside" just for elo gain. In fact i would make a case to not adjust it during training at all, or only very carefully in cases of improvements stalls over longer periods Of course these tuned values can still be used outside training for matches (e.g. TCEC) just fine. My rational is "weak puct irrelevancy": In the long term (bar the minor contrainsts for the sum-to-one softmax restriction) policy learning will just scale the policy head output roughly to "whatever_leea_likes/what_ever_we_set_in_cfg_puct", effectively nullifying our changes to puct. It is obvious that the adjustment in learning might run into problems due to regularization restrictions and the softmax sum-to-one restriction if our set_puct is at odds to whatever direction the net is taking.

Reading the AG0 and A0 papers it seems that Deep Mind also realized this. While in AG0 they state:

MCTS search parameters were selected by Gaussian process optimization68, so as to optimize self-play performance of AlphaGo Zero using a neural network trained in a preliminary run. For the larger run (40 blocks, 40 days), MCTS search parameters were re-optimized using the neural network trained in the smaller run (20 blocks, 3 days). The training algorithm was executed autonomously without human intervention.

So they it seems they never changed search parameters during training of the 40 block network. All search parameters were adjusted based on results of the fully trained smaller networks. In A0 they also never adjusted search parameters:

In AlphaZero we reuse the same hyper-parameters for all games without game-specific tuning. The sole exception is the noise that is added to the prior policy to ensure exploration (29); this is scaled in proportion to the typical number of legal moves for that game type.

Unless otherwise specified, the training and search algorithm and parameters are identical to AlphaGo Zero (29).

So in conclusion i would argue for not adjusting search parameters in training based on "outside of training" self-play elo gain parameter optimization, in particular puct. Much of the same could be said about fpu reduction - abeit to a lesser degree, as i do not know if leela can really replicate changes to the fpu reduction with changes in policy and/or value. But if we have problems in learning we might want to look at shifting puct very gradually - but only if learning is really stagnating and there no other solutions to the oversampling problems. These gradual changes should also be more based on the requirements of learning (value head overfitting etc.) than self-play elo gains.

glinscott / leela-chess

Treat UCT search configuration parameters as hyperparameters #606