MCTS profiling - Githubissues

shindavid commented 1 year ago

This is still pending my completion of the python->c++ port of the MCTS logic, of course. This is taking me longer than one might expect because I am building in all the parallelization niceties that the python implementation ignored: multiple search threads, a dedicated neural net eval thread which batches requests from the search threads, etc. Doing this in a game-agnostic manner requires some heavy duty c++ template code.

Once the c++ MCTS code is working, there will be a number of interrelated options/knobs to play with:

Neural network eval thread: batch-size and timeout
Number of search threads / affinity-pinning
Neural network evaluation cache size
Thread safety mechanics (trade-off between minimizing thread contention vs logic complexity)
Node initialization sequencing (lazy initialization of some members might better distribute work across threads, at cost of logic complexity)

To make good configuration decisions and to make data-driven decisions on whether to incur logical complexity, we need to monitor a few things:

GPU utilization
CPU utilization
Search thread wait times. How much time is spent (1) waiting for the neural network eval thread, and (2) waiting for other search threads to release mutexes?
Neural network cache hit/miss stats

Here is a good link that describes some of the considerations with regards to GPU utilization: https://medium.com/oracledevs/lessons-from-alpha-zero-part-5-performance-optimization-664b38dc509e

The literature even contains lock-free approaches. But MCTS has grown more complex in the years since that paper was written, and I don't think the lock-free approach is viable anymore. Notably, KataGo relies a complex hodge-podge of std::atomic and std::mutex.

Eventually I could imagine an "auto-tuner" process which chooses values based on automated experimentation. Optimal configuration will surely be dependent on the hardware being used (number of GPU's available to the machine being a major consideration).

shindavid commented 1 year ago

Jotting as a reference here, profiling of python MCTS self-play with 400 rollouts on my desktop:

$ ./py/connect4/profiling_experiment.py -g 20 -i 400
...
Avg runtime: 29.456s
Max runtime: 37.145s
Min runtime: 19.452s

shindavid commented 1 year ago

So far, without using parallel games, c++ performance is ~5x faster measured by avg runtime, mostly thanks to multithreading and batched net evals:

$ ./target/Release/bin/c4_mcts_vs_perfect -g 20 -m 400
Avg runtime: 6.137s
Max runtime: 9.518s
Min runtime: 4.438s

The current configuration has about 30% GPU utilization. We can crank this up by increasing the batch size and num search-threads, but without parallel games, that will lead to game-play-quality deterioration. With parallel games we can increase batch-size while keeping num search-threads constant, and that should improve speed without sacrificing game-play-quality.

shindavid commented 1 year ago

$ ./target/Release/bin/c4_self_play
...
Self-play complete!
W530 L341 D129
Parallelism factor:     100
Num games:             1000
MCTS iters:             400
MCTS search threads:      8
MCTS max batch size:    144
MCTS avg batch size: 123.72
Total runtime:      47.844s
Avg runtime:         0.048s

Current default configuration on my machine averages a throughput of 1 self-play game (with 400 iters) every 0.048 seconds.

Compared to the python implementation (1 game ever 29.456s), this is more than 613x faster.

shindavid / AlphaZeroArcade

MCTS profiling #11