kblomdahl / dream-go

Artificial go player based on reinforcement and supervised learning
Apache License 2.0
47 stars 8 forks source link

MCTSnet instead of monte carlo tree search #32

Closed kblomdahl closed 5 years ago

kblomdahl commented 6 years ago

I suggest that we implement MCTSnet instead of traditional tree search. This should provide an improvement because (i) it allows for richer statistics to be extracted from each terminal game¹, and (ii) it allows us to compute better policy values, by keeping two distinct readout and policy networks².

¹ Traditional tree search only provide a binary win / lose statistic. ² This is a bit complicated to explain, but in traditional tree search use estimated win rate as an indication of which candidate moves should be explored further, but this is not necessarily a good indication of what moves should be explored. See, for example, first play urgency [1] for further discussions on why this is a good thing.

MCTSnet architecture

Tensorflow training script

To implement MCTSnet in tensorflow there are two complications:

  1. We need to update the board state during the tree search. This is a problem because we cannot calculate a gradient through the update step. This can be solved by adding a likelihood REINFORCE gradient to the gradient of the loss function:

    [ Σ ∂ / (∂ θ) log π(h) ] · loss(x)
  2. We need to implement a tree memory in tensorflow. This can probably be done using sparse lookup's, and while loops. But it might be much easier to use eager execution to support the dynamic control flow. Using eager execution would provide lower performance however as it would force us to use a batch size of one.

This provides the following mock-up code for the training script, where step 1 and 2 can be batched by using some background threads:

  1. Create a dataset that yields the triplet (state, logits, winner).
  2. Map each entry in the dataset to 25 entries, one for each iteration of the MCTSnet:
    • Probe into the search tree, keeping track of each ∂ / (∂ θ) log π(h) encountered.
    • Calculate the loss of the readout network, and its gradient.
    • Calculate the REINFORCE gradient, where R is the difference between the final loss and the loss of the current step: [ Σ ∂ / (∂ θ) ρ(...) ] + [ Σ ∂ / (∂ θ) log π(h) ] · R
  3. Batch, and apply the average of the gradients to the parameters.

Network architectures

Backup (β)

The backup network is responsible for combining statistics from a node in the search tree to its parent. The architecture is a gated recurrent unit (GRU):

Learned simulation policy (π)

The readout network is responsible for determining what moves to actually play given some statistics. The architecture is the same as the policy head in the AlphaZero article.

Embedding (ε)

This network is responsible for summarizing a board state into a set of statistics with shape [2, 361]. The architecture is the same as the residual tower in the AlphaZero article.

Readout (ρ)

The readout network is responsible for determining what moves to actually play, and determining the winrate, given some statistics. The architecture is equivalent to the policy head, and the value head in the AlphaZero article.

TODO

Citations

[1] "Modifications of UCT and sequence-like simulations for Monte-Carlo Go", Yizao Wang and Sylvain Gelly, http://dept.stat.lsa.umich.edu/~yizwang/publications/wang07modifications.pdf [2] "Gradient Estimation Using Stochastic Computation Graphs", John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel, https://arxiv.org/pdf/1506.05254.pdf

kblomdahl commented 6 years ago

One open question is what to do this the value head of the current network. It is no longer necessary, since it was previously mainly used by the MCTS but now the statistics are used instead.

However it was used to determine whether to surrender, and it is very useful feedback to human operators. There are also the (non-proven) claim from DeepMind that it acts are regularization, and help stability the internal representation so we think that we should keep it even if it is not strictly speaking necessary.

kblomdahl commented 5 years ago

Using the backup network suggested in the original post does not seem to work very well. It experience exploding values as the number of rollouts increase, which in turn leads to numerical instabilities. Replacing the suggested backup (β) architecture with a GRU [1] seems to work much better and stabilize the training (even if it converge much slower than the initial suggestion). The choice of GRU is motivated by the existence of the cudnnRNNForwardInference function in cuDNN, which should allow for a very efficient implementation.

Using a GRU for the backup (β) network also implies that we need to change the last activation of the embedding (ε) to a hyperbolic tangent to ensure the node statistics has a consistent value range.


[1] "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation", Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, https://arxiv.org/abs/1406.1078 [2] NVIDIA CUDA Deep Neural Network (cuDNN) Developer Guide