Speculative MCTS & Dynamic virtual loss

TL;DR I had what I believe to be a novel idea to effectively double virtual losses (the number of positions evaluated simultaneously (e.g. without updating MCTS)) by speculatively executing a child of the current position inside TF. And a 2nd idea to dynamically select virtual loss.

Background

Ideally we'd evaluate one position (AKA PV AKA branch) update the MCTS with the new information and then choose a 2nd position to explore. In the real world this can be sub-optimal because it leaves the CPU/GPU/TPU idle, additionally for neural nets based engines GPU/TPU have sub-linear scaling so a large batch size executes FASTER as we add more positions to consider at a time.

The classic solution is at each time step have MCTS selects some set of these (we call the size of this set "batch size" or "virtual losses" (shortened to vloss) and pass it as a static parameter) and evaluates them in parallel¹.

The hope is that the 2nd, 3rd, nth positions selected would have been selected later as most interesting position (AKA PV) or that they still contribute to the general average of a position (the 2nd most sever move might still be sever enough to consider).

If we imagine sorting all the action scores of all the leaf nodes we'd get a histogram something like this screenshot from 2018-09-18 21-25-27

If virtual losses is well turned (the green line) we explore the best-most-interesting positions and a couple extra that were likely to have been explored soon after (if we were using vloss=1). If virtual losses is a little large (yellow line) we explore some "okay" positions (e.g. the 2nd most interesting moves from a position or branch). If we select too large of a v loss it leads to the red line where we explore many positions that would never have been considered.

It's worth noting that exploring extra positions isn't just a waste of time but may actually decrease the quality of the search; traditional AB engines only consider the best child of a node and benefit from knowing that "playing anything but the capture at D4 is bad". MCTS engines instead calculate the position as the average of all it's children so adding the information that "most moves under this position are bad" decreases the Q of that position (even if one child is a strong move). Further v loss limits the max depth of reading to readouts / v_loss as MCTS can only select one deeper in the PV each iteration.

Stuck with the trade off between faster evaluations and lower quality someone introduced another idea: Root parallelism. Instead of playing many positions from one game evaluate one position from many games (parameter parallel_games). This avoids lowering the strength of MCTS while keeping a large enough batch size to maximizing the throughput of the NN. Both MiniGo (MG) and leela-zero (LZ) use root parallelism to fully saturate the GPU/TPU. Unfortunately root parallelism makes each individual game take longer to complete which is a big problem for MG where we generate a new model multiple times per hour (which is close to the average length of a game when parallel_games is high).

All of this rolling around in my head lead to two ideas. Speculative MCTS and variable size virtual loss.

Speculative MCTS

Inside the tensorflow code execute the best child of the current position.

This could be a huge benefit because

We have significant overhead between the C++ MCTS and TF.
Essentially reduces v losses by a factor of 2
Allows deeper reading

Plan

Construct the set of moves that are legal and don't result in capture, e.g. history changes by cloning the last state and adding a single stone.
Pass the mask of 1. with the history planes to the GPU.
Evaluate the entire NN in TF (IFAIK LZ & MG both calculate some portion of the value / policy head in C++)
Evaluate argmax(policy value * mask from 2.)
Return value & policy of both nodes to C++

Dynamic virtual losses

In Background I describe why choosing too large of a v loss can be bad, instead of choosing a small fixed number let MCTS choose the number of positions that have high action score (e.g. action score >= 0.95 * max(action score)). In less sharp positions (AKA many moves and PVs look good) this allows us to consider many possible moves at the same time. In sharp positions (ladders, capture races) it focuses MCTS on the most important line.

Experiments

As always @amj has provided some quality feedback and advice on how to measure.

For Speculative MCTS, I can simulate the change in code (eval a batch, eval best child, incorporate results). This potentially avoids the complexity of implementing the code.
For dynamic virtual losses I can measure the strength improvement by padding with no-op positions and the performance hit by measuring the percent of no-op positions.

Footnotes

A 2nd parameter is used to try and selects positions from relatively different parts of the tree.

Nice analysis and summary. But I am confused why anyone would use batch size and virtual loss to refer to the same thing; virtual loss to me is certain modification applied to the action values (Q or U or both) of nodes along the path to a leaf node that is being evaluated (to discourage search from going into the same leaf node), and usually the size of virtual loss determines the magnitude of the modification.

For batching with LZ in matches (with very high playouts) instead of self-play, I use a vloss accumulation strategy to avoid these evaluating nodes in my (alreadydone/lz) branches with the word accum; just keep the virtual loss along the path if you hit the same leaf node again and again. However this is CPU consuming, and so I am leaning towards combining this with an earlier approach which is also natural, namely to expand the next sibling of the leaf node, if the leaf node has accumulated a certain number of vloss. I've thought about keeping a concurrent priority queue based on action value, but the action value would ideally need to be updated after each backup, and there's many intermediate nodes whose action values also need to be taken account of because the search choose one of the children at each of the nodes, but there are no natural probabilities derived from action values to take product of. Moreover I also haven't thought through how to avoid too much thread contention.

To avoid a node that shouldn't be selected (at the moment) affecting its ancestors' evals, I introduced a "fractional backup" strategy, namely calculate based on current winrates how many more visits is needed for the node to be selected, and weigh the node's eval accordingly to make the proportion of its contribution the same as at the moment it is selected in non-parallel MCTS (i.e. if the current visits is x, and the node should be selected after y visits, then weigh its eval by x/y). (This is only a rough approximation especially at large batch size, as many siblings can be evaluated at the same time and if we run non-parallel MCTS, their action values can overtake each other many times before a particular one is selected again.)

I don't understand why there is a big problem with high parallel_games. Do you normally see big strength improvement within an hour? You would just be training new models with games produced by models several generations ago, but the games would be of higher quality (and maybe quantity).

tensorflow / minigo

Speculative MCTS & Dynamic virtual loss #443

Background

Speculative MCTS

Plan

Dynamic virtual losses

Experiments