shindavid / AlphaZeroArcade

6 stars 1 forks source link

Various improvements #90

Closed shindavid closed 2 months ago

shindavid commented 9 months ago

I've learned many things from discussions on various Discord channels, and from discussions with the author of the very similar kZero project. Here is a list of some of those things:

  1. Single-threaded MCTS during training self-play: There is no need to use multi-threaded MCTS during training self-play, since you can saturate the GPU by increasing the number of parallel games. This should improve the quality of the self-play games.
  2. Do not store game state/tensorizor on each node: All you need is the root game state. The state can be copied by each search thread and then modified as it traverses the tree. It doesn't need to ever be copied to the node.
  3. Copy the policy probability to the edge: This beats storing the entire policy at each node. (although, it requires fully expanding all edges when expanding a node, which means increased RAM)
  4. Node indirection: Before I added MCGS, each parent stored just a Node pointer and an int, to represent the child array. After MCGS, this was changed to a list of edges, each of which has a Node pointer. We can actually still use the more compact (pointer, int) representation; just have each Node* have the option to redirect to an "original" Node. This may make it easier to switch to object pools. (Not sure if there are some difficult details here with tree-reuse and smart-pointers...)
  5. Continuous self-play process: Currently, AlphaZeroArcade stops the self-play process and restarts it whenever the network weights are updated. Instead, keep the process running forever, and just have it reload network weights from disk when available. This can be in the middle of a game, leading to the beginning of the game using network1 and the end of the game using network2. Nothing wrong with that!
  6. Training window management logic: kZero initializes a window with games produced by a dummy uniform net. Then each position in this batch is artificially marked as having been sampled T times. As games are added to the dataset, the window shifts. If the average number of times the window's rows were sampled drops below T, the window is used for training. I currently lack the italicized part, which makes my version much less smooth.
KarelPeeters commented 9 months ago

Some notes on the first couple of changes:

  1. Some amount of virtual loss (not multithreading) is still useful to lower loop latency (NN -> selfplay -> data -> NN).
  2. Note that this is a CPU/RAM tradeoff. I think it's worth it just for the ability to do huge searches, but this should probably be profiled at some point.
  3. I think a third option is even better: have nodes point to some external arena-allocated policy vecs. This allows lazily allocating child nodes that might never be visited (KataGo does the latter, not sure how they store their policies).
shindavid commented 9 months ago

On arena-allocated policy-vecs: do you envision these to be variable-sized, or of a fixed-size big-enough to handle the max-branching-factor states of the game?

I currently store them local to the node, using a fixed size. This is undesirable because of wasted-space and also because it requires game-specific configuration of a branching-factor upper-bound. But it does allow for lazy child node allocation, which is a form of space saving.

Variable-sized can be easily achieved by just using new float[] and delete[] and relying on malloc to do the memory management work. With more work we can make sure all those floats come from one big memory pool that we allocate up-front, but it's not obvious that this will be much of a win and worth the effort.

KarelPeeters commented 9 months ago

I was definitely thinking of making them variable-sized, one of the major reasons of doing this in the first place is to save memory.

To clarify, there are two major ways to store policy:

The first way sadly doesn't work well with lazy child initialization, so I don't think this is the best long-term solution.

For the second approach we can separately decide where to store this array:

I think the right solution is an array of floats in the parent and either of the last two options for where to store it. We can start with malloc and see if it's good enough.

shindavid commented 9 months ago

If using an object pool, we will similarly need to free the memory.

Unless...you are thinking of going with the approach of just doing a big clear of the entire pool in between searches, like kZero currently does for nodes. I feel like this approach won't work with tree-reuse/MCGS.

KarelPeeters commented 9 months ago

Yeah, that's a good point, all of the non-malloc solutions become quite a bit more complicated with MCGS. Tree reuse by itself isn't a big deal: a single-pass over the tree is enough to compact everything again.

I just remember being frustrated by running a long search (~1M nodes), extracting the result, and then having to wait a couple of seconds at the end just to free everything. That's not a big deal yet though, and certainly not the first priority to optimize!

shindavid commented 9 months ago

I experienced the same frustration, and so I used to have a separate cleanup thread responsible for freeing nodes. When deleting a node, the main thread would just put it on a cleanup queue, that the cleanup thread would process asynchronously. But I removed this at some point, and I don't remember exactly why. Maybe there was a bad interaction with MCGS? In any case, I think this should be doable in principle.

shindavid commented 2 months ago

All items listed here are finally completed!

Some notes: