Deterministic mcts - Githubissues

shindavid commented 10 months ago

This change rethinks the standard way to parallelize MCTS via virtual loss.

There is now only one search thread that updates the official values/counts in a given MCTS tree. The thread is not a very hard-worker; it doesn't do any children-expansion or net-evaluations. If it hits a node that currently lacks the required info, it blocks until it becomes available.

The hard work is relegated to prefetch threads, which zip around the tree, working hard to pave the way for the search thread. They try to predict where the single search thread will want to look next. They do this by doing PUCT calculations on unofficial-values/counts, using virtual loss. If they do their job well, the search thread coasts and rarely blocks. If they do it poorly, the search thread frequently blocks, and in the worst case, issues a command to the prefetch threads to reset their unofficial stats to the official ones.

This makes it so that parallel-MCTS-related parameters (num threads, virtual loss size) only affect MCTS speed, and not MCTS behavior.

lightvector commented 10 months ago

You might want to check what the relative playing strength of this formulation is with a well-trained net versus the prior implementation, given equal compute time and equal hardware for some reasonable hardware. There's a chance you're paying something nontrivial in exchange for determinism here.

shindavid commented 10 months ago

@lightvector Getting this comment from you is an honor and represents a milestone for the project. :)

If I am indeed paying something nontrivial, do you have any suspicions on the nature of that payment? For example, here are some distinct non-overlapping (and exhaustive??) possibilities:

Nondeterminism in the traditional implementation accidentally finds good moves a nontrivial percentage of the time.
The reset mechanic of this implementation is needed often enough to significantly increase the overall runtime.
The memory increase of this implementation meaningfully reduces the permissible tree size.
In this implementation, the work that the single search thread performs bottlenecks the system in a way that it didn't when it was split across multiple threads.

I will try to my best to test on the limited set of games I currently support, but it would be hard to extrapolate from whatever data I obtain to more complex-games/mature-implementations like go/KataGo. If you have intuitions on this, it might help me focus my tests (or push me to add more games...maybe even go).

I'll also add that my objectives might slightly differ from yours. I'm hoping to create a general system that minimizes the need for game-specific parameterization. If the standard nondeterminisic approach demands a certain alignment of {num-virtual-threads, num-mcts-iterations, virtual-loss-size} that is game-specific, while my approach is more resilient to those parameter choices across a variety of games, I would consider that a win, even if there is no clear gain in strength of my approach relative to an optimally tuned traditional approach. (If there is a clear loss in strength, that's a different story.)

Well, in my system, to the extent that those choices affect search speed, and to the extent that search speed dictates playing-strength-per-compute, the choices may meaningfully affect strength. But, search speed is easier to optimize in an automated fashion than playing-strength, which may justify the formulation change.

lightvector commented 9 months ago

I'm not sure, but I think 1 might be a slight contribution. I think 2 might be a mildly bigger contribution. I think 3 and 4 are unlikely to be a problem.

You forgot 5 - the multithreaded search due to butterfly effects and greater exploration just happens to search a reasonable fraction of nodes that the 1-threaded search doesn't. You're effectively building two different MCTS trees in different ways. Even if all the nodes would eventually get explored by both the same way, at any fixed stopping point you choose, those two trees are not going to overlap perfectly at that stopping point, and all the non-overlapping nodes are wasted compute. I would guess that a lot of the time, the extra nodes from a multithreaded search are not actively harmful, they're just less helpful than a real "non-virtual-loss" node. Suppose you have, e.g. 25% non-overlap in the search tree, and suppose the extra nodes would be 60% as "helpful", so that 25% excess is as good as 15% of real nodes would be on average, then in that case you would have gotten a global 15% performance loss.

I think effect 5, if it's real, might be bigger than any of the effects 1-4 you mentioned.

lightvector commented 9 months ago

Although I guess it depends on your accounting and your reset mechanism and exactly how it's implemented. For example if your resets are relatively aggressive, then it might be "less" of effect 5 and "more" of effect 2 and then effect 2 might be more dominant instead.

Edit:

I would guess that a lot of the time, the extra nodes from a multithreaded search are not actively harmful,

Incidentally, this is why I'm pretty sure you in the "just don't increment the edge visits" version of discounting playouts driven by virtual loss, you only want to discount playouts that are sufficiently bad by some threshold. You only want to disregard the playouts that would be actively harmful to include in the tree, but as long as a playout would have positive usefulness to the search even if virtual loss would make it not the optimal playout, you want to keep it.

shindavid commented 9 months ago

Thanks, you are absolutely right. I realized my omission of effect 5 pretty soon after starting some initial experimentation. It (and the interplay of that with effect 2) I think will prove to be the crux of whether this idea turns out to be worthwhile or not.

I realized that it may be possible to combine this determinism idea with your "conditionally increment edge counts" idea. The prefetch threads are free to do whatever they want to best achieve their goal of assisting the master search thread, and that can include conditionally incrementing edge counts.

I have in mind also potentially trying a variant of Leela Chess Zero's alternative to virtual loss, replacing the assumption of unchanged Q with a prediction from the aux value head.

With such improvements, the prefetch threads may get closer and closer to matching the master search thread (as far as maintaining edge counts goes). As a result, the overhead of this proposal (from effects 2 and 5) should shrink. At the same time, the more the prefetch threads match the master search thread, the less likely that this proposal adds any value at all. Perhaps the only relevant use of the proposal might be to provide a well-defined reference baseline against which to measure candidate nondeterministic approaches/parameterizations.

shindavid / AlphaZeroArcade

Deterministic mcts #88