Open shindavid opened 1 year ago
Jotting as a reference here, profiling of python MCTS self-play with 400 rollouts on my desktop:
$ ./py/connect4/profiling_experiment.py -g 20 -i 400
...
Avg runtime: 29.456s
Max runtime: 37.145s
Min runtime: 19.452s
So far, without using parallel games, c++ performance is ~5x faster measured by avg runtime, mostly thanks to multithreading and batched net evals:
$ ./target/Release/bin/c4_mcts_vs_perfect -g 20 -m 400
Avg runtime: 6.137s
Max runtime: 9.518s
Min runtime: 4.438s
The current configuration has about 30% GPU utilization. We can crank this up by increasing the batch size and num search-threads, but without parallel games, that will lead to game-play-quality deterioration. With parallel games we can increase batch-size while keeping num search-threads constant, and that should improve speed without sacrificing game-play-quality.
$ ./target/Release/bin/c4_self_play
...
Self-play complete!
W530 L341 D129
Parallelism factor: 100
Num games: 1000
MCTS iters: 400
MCTS search threads: 8
MCTS max batch size: 144
MCTS avg batch size: 123.72
Total runtime: 47.844s
Avg runtime: 0.048s
Current default configuration on my machine averages a throughput of 1 self-play game (with 400 iters) every 0.048 seconds.
Compared to the python implementation (1 game ever 29.456s), this is more than 613x faster.
This is still pending my completion of the python->c++ port of the MCTS logic, of course. This is taking me longer than one might expect because I am building in all the parallelization niceties that the python implementation ignored: multiple search threads, a dedicated neural net eval thread which batches requests from the search threads, etc. Doing this in a game-agnostic manner requires some heavy duty c++ template code.
Once the c++ MCTS code is working, there will be a number of interrelated options/knobs to play with:
To make good configuration decisions and to make data-driven decisions on whether to incur logical complexity, we need to monitor a few things:
Here is a good link that describes some of the considerations with regards to GPU utilization: https://medium.com/oracledevs/lessons-from-alpha-zero-part-5-performance-optimization-664b38dc509e
The literature even contains lock-free approaches. But MCTS has grown more complex in the years since that paper was written, and I don't think the lock-free approach is viable anymore. Notably, KataGo relies a complex hodge-podge of
std::atomic
andstd::mutex
.Eventually I could imagine an "auto-tuner" process which chooses values based on automated experimentation. Optimal configuration will surely be dependent on the hardware being used (number of GPU's available to the machine being a major consideration).