Closed sethtroisi closed 5 years ago
Nice analysis and summary. But I am confused why anyone would use batch size and virtual loss to refer to the same thing; virtual loss to me is certain modification applied to the action values (Q or U or both) of nodes along the path to a leaf node that is being evaluated (to discourage search from going into the same leaf node), and usually the size of virtual loss determines the magnitude of the modification.
For batching with LZ in matches (with very high playouts) instead of self-play, I use a vloss accumulation strategy to avoid these evaluating nodes in my (alreadydone/lz) branches with the word accum
; just keep the virtual loss along the path if you hit the same leaf node again and again. However this is CPU consuming, and so I am leaning towards combining this with an earlier approach which is also natural, namely to expand the next sibling of the leaf node, if the leaf node has accumulated a certain number of vloss. I've thought about keeping a concurrent priority queue based on action value, but the action value would ideally need to be updated after each backup, and there's many intermediate nodes whose action values also need to be taken account of because the search choose one of the children at each of the nodes, but there are no natural probabilities derived from action values to take product of. Moreover I also haven't thought through how to avoid too much thread contention.
To avoid a node that shouldn't be selected (at the moment) affecting its ancestors' evals, I introduced a "fractional backup" strategy, namely calculate based on current winrates how many more visits is needed for the node to be selected, and weigh the node's eval accordingly to make the proportion of its contribution the same as at the moment it is selected in non-parallel MCTS (i.e. if the current visits is x, and the node should be selected after y visits, then weigh its eval by x/y). (This is only a rough approximation especially at large batch size, as many siblings can be evaluated at the same time and if we run non-parallel MCTS, their action values can overtake each other many times before a particular one is selected again.)
I don't understand why there is a big problem with high parallel_games. Do you normally see big strength improvement within an hour? You would just be training new models with games produced by models several generations ago, but the games would be of higher quality (and maybe quantity).
TL;DR I had what I believe to be a novel idea to effectively double virtual losses (the number of positions evaluated simultaneously (e.g. without updating MCTS)) by speculatively executing a child of the current position inside TF. And a 2nd idea to dynamically select virtual loss.
Background
Ideally we'd evaluate one position (AKA PV AKA branch) update the MCTS with the new information and then choose a 2nd position to explore. In the real world this can be sub-optimal because it leaves the CPU/GPU/TPU idle, additionally for neural nets based engines GPU/TPU have sub-linear scaling so a large batch size executes FASTER as we add more positions to consider at a time.
The classic solution is at each time step have MCTS selects some set of these (we call the size of this set "batch size" or "virtual losses" (shortened to vloss) and pass it as a static parameter) and evaluates them in parallel1.
The hope is that the 2nd, 3rd, nth positions selected would have been selected later as most interesting position (AKA PV) or that they still contribute to the general average of a position (the 2nd most sever move might still be sever enough to consider).
If we imagine sorting all the action scores of all the leaf nodes we'd get a histogram something like this
If virtual losses is well turned (the green line) we explore the best-most-interesting positions and a couple extra that were likely to have been explored soon after (if we were using vloss=1). If virtual losses is a little large (yellow line) we explore some "okay" positions (e.g. the 2nd most interesting moves from a position or branch). If we select too large of a v loss it leads to the red line where we explore many positions that would never have been considered.
It's worth noting that exploring extra positions isn't just a waste of time but may actually decrease the quality of the search; traditional AB engines only consider the best child of a node and benefit from knowing that "playing anything but the capture at D4 is bad". MCTS engines instead calculate the position as the average of all it's children so adding the information that "most moves under this position are bad" decreases the Q of that position (even if one child is a strong move). Further v loss limits the max depth of reading to
readouts / v_loss
as MCTS can only select one deeper in the PV each iteration.Stuck with the trade off between faster evaluations and lower quality someone introduced another idea: Root parallelism. Instead of playing many positions from one game evaluate one position from many games (parameter parallel_games). This avoids lowering the strength of MCTS while keeping a large enough batch size to maximizing the throughput of the NN. Both MiniGo (MG) and leela-zero (LZ) use root parallelism to fully saturate the GPU/TPU. Unfortunately root parallelism makes each individual game take longer to complete which is a big problem for MG where we generate a new model multiple times per hour (which is close to the average length of a game when parallel_games is high).
All of this rolling around in my head lead to two ideas. Speculative MCTS and variable size virtual loss.
Speculative MCTS
Inside the tensorflow code execute the best child of the current position.
This could be a huge benefit because
Plan
Dynamic virtual losses
In Background I describe why choosing too large of a v loss can be bad, instead of choosing a small fixed number let MCTS choose the number of positions that have high action score (e.g. action score >= 0.95 * max(action score)). In less sharp positions (AKA many moves and PVs look good) this allows us to consider many possible moves at the same time. In sharp positions (ladders, capture races) it focuses MCTS on the most important line.
Experiments
As always @amj has provided some quality feedback and advice on how to measure.
Footnotes