shindavid / AlphaZeroArcade

8 stars 1 forks source link

MCTS profiling #11

Open shindavid opened 1 year ago

shindavid commented 1 year ago

This is still pending my completion of the python->c++ port of the MCTS logic, of course. This is taking me longer than one might expect because I am building in all the parallelization niceties that the python implementation ignored: multiple search threads, a dedicated neural net eval thread which batches requests from the search threads, etc. Doing this in a game-agnostic manner requires some heavy duty c++ template code.

Once the c++ MCTS code is working, there will be a number of interrelated options/knobs to play with:

To make good configuration decisions and to make data-driven decisions on whether to incur logical complexity, we need to monitor a few things:

Here is a good link that describes some of the considerations with regards to GPU utilization: https://medium.com/oracledevs/lessons-from-alpha-zero-part-5-performance-optimization-664b38dc509e

The literature even contains lock-free approaches. But MCTS has grown more complex in the years since that paper was written, and I don't think the lock-free approach is viable anymore. Notably, KataGo relies a complex hodge-podge of std::atomic and std::mutex.

Eventually I could imagine an "auto-tuner" process which chooses values based on automated experimentation. Optimal configuration will surely be dependent on the hardware being used (number of GPU's available to the machine being a major consideration).

shindavid commented 1 year ago

Jotting as a reference here, profiling of python MCTS self-play with 400 rollouts on my desktop:

$ ./py/connect4/profiling_experiment.py -g 20 -i 400
...
Avg runtime: 29.456s
Max runtime: 37.145s
Min runtime: 19.452s
shindavid commented 1 year ago

So far, without using parallel games, c++ performance is ~5x faster measured by avg runtime, mostly thanks to multithreading and batched net evals:

$ ./target/Release/bin/c4_mcts_vs_perfect -g 20 -m 400
Avg runtime: 6.137s
Max runtime: 9.518s
Min runtime: 4.438s

The current configuration has about 30% GPU utilization. We can crank this up by increasing the batch size and num search-threads, but without parallel games, that will lead to game-play-quality deterioration. With parallel games we can increase batch-size while keeping num search-threads constant, and that should improve speed without sacrificing game-play-quality.

shindavid commented 1 year ago
$ ./target/Release/bin/c4_self_play
...
Self-play complete!
W530 L341 D129
Parallelism factor:     100
Num games:             1000
MCTS iters:             400
MCTS search threads:      8
MCTS max batch size:    144
MCTS avg batch size: 123.72
Total runtime:      47.844s
Avg runtime:         0.048s

Current default configuration on my machine averages a throughput of 1 self-play game (with 400 iters) every 0.048 seconds.

Compared to the python implementation (1 game ever 29.456s), this is more than 613x faster.