rlglab / minizero

MiniZero: An AlphaZero and MuZero Training Framework
72 stars 18 forks source link
alphazero atari board-games deep-reinforcement-learning go gomoku gumbel-alphazero gumbel-muzero hex killall-go mcts monte-carlo-tree-search muzero nogo othello outer-open-gomoku reinforcement-learning tictactoe

MiniZero

MiniZero is a zero-knowledge learning framework that supports AlphaZero, MuZero, Gumbel AlphaZero, and Gumbel MuZero algorithms.

This is the official repository of the IEEE ToG paper MiniZero: Comparative Analysis of AlphaZero and MuZero on Go, Othello, and Atari Games.

If you use MiniZero for research, please consider citing our paper as follows:

@article{wu2024minizero,
  title={MiniZero: Comparative Analysis of AlphaZero and MuZero on Go, Othello, and Atari Games},
  author={Wu, Ti-Rong and Guei, Hung and Peng, Pei-Chiun and Huang, Po-Wei and Wei, Ting Han and Shih, Chung-Chin and Tsai, Yun-Jui},
  journal={IEEE Transactions on Games},
  year={2024},
  publisher={IEEE}
}

Outline

Overview

MiniZero utilizes zero-knowledge learning algorithms to train game-specific AI models.

It includes a variety of zero-knowledge learning algorithms:

It supports a variety of game environments:

We are planning to add new algorithms, features, and more games in the future.

Architecture

The MiniZero architecture comprises four components: a server, self-play workers, an optimization worker, and data storage.

MiniZero Architecture

Server The server is the core component in MiniZero, controlling the training process and managing both the self-play and optimization workers. In each iteration, the server first instructs all self-play workers to generate self-play games simultaneously using the latest network and collects game records from self-play workers. Once the server accumulates the necessary self-play games, it then stops the self-play workers and instructs the optimization worker to load the latest game records and start network updates. After the network has been updated, the server starts the next iteration until the training reaches a predetermined maximum iteration.
Self-play worker The self-play worker interacts with the environment to produce self-play games. There may be multiple self-play workers. Each self-play worker maintains multiple MCTS instances to play multiple games simultaneously with batch GPU inferencing to improve efficiency. Specifically, the self-play worker runs the selection for each MCTS to collect a batch of leaf nodes and then evaluates them through batch GPU inferencing. Finished self-play games are sent to the server and forwarded to the data storage by the server.
Optimization worker The optimization worker updates the network using collected self-play games. Specifically, it loads self-play games from data storage and stores them into the replay buffer, and then updates the network over steps using data sampled from the replay buffer. Generally, the number of optimized steps is proportional to the number of collected self-play games to prevent overfitting. Finally, the updated networks are stored into the data storage.
Data storage The data storage stores network files and self-play games. Specifically, it uses the Network File System (NFS) for sharing data across different machines. This is an implementation choice; a simpler file system can suffice if distributed computing is not employed.

Results

The performance of each zero-knowledge learning algorithm on board games and Atari games are shown as follows, where α0, μ0, g-α0, and g-μ0 represent AlphaZero, MuZero, Gumbel AlphaZero, and Gumbel MuZero, and $n$ represents simulation count. More details and publicly released AI models are available here.

Results on board games:

Go 9x9Othello 8x8

Results on Atari games:

Atari

Prerequisites

MiniZero requires a Linux platform with at least one NVIDIA GPU to operate. To facilitate the use of MiniZero, a container image is pre-built to include all required packages. Thus, a container tool such as docker or podman is also required.

Show platform recommendations * Modern CPU with at least 64G RAM * NVIDIA GPU of GTX 1080 (VRAM 8G) or above * Linux operating system, e.g., Ubuntu 22.04 LTS
Show tested platforms |CPU|RAM|GPU|OS| |---|---|---|--| |Xeon Silver 4216 x2|256G|RTX A5000 x4|Ubuntu 20.04.6 LTS| |Xeon Silver 4216 x2|128G|RTX 3080 Ti x4|Ubuntu 20.04.5 LTS| |Xeon Silver 4216 x2|256G|RTX 3090 x4|Ubuntu 20.04.5 LTS| |Xeon Silver 4210 x2|128G|RTX 3080 x4|Ubuntu 22.04 LTS| |Xeon E5-2678 v3 x2|192G|GTX 1080 Ti x4|Ubuntu 20.04.5 LTS| |Xeon E5-2698 v4 x2|128G|GTX 1080 Ti x1|Arch Linux LTS (5.15.90)| |Core i9-7980XE|128G|GTX 1080 Ti x1|Arch Linux (6.5.6)|

Quick Start

This section walks you through training AI models using zero-knowledge learning algorithms, evaluating trained AI models, and launching the console to interact with the AI.

First, clone this repository.

git clone git@github.com:rlglab/minizero.git
cd minizero # enter the cloned repository

Then, start the runtime environment using the container.

scripts/start-container.sh # must have either podman or docker installed

Once a container starts successfully, its working folder should be located at /workspace. You must execute all of the following commands inside the container.

Training

To train 9x9 Go:

# AlphaZero with 200 simulations
tools/quick-run.sh train go az 300 -n go_9x9_az_n200 -conf_str env_board_size=9:actor_num_simulation=200

# Gumbel AlphaZero with 16 simulations
tools/quick-run.sh train go gaz 300 -n go_9x9_gaz_n16 -conf_str env_board_size=9:actor_num_simulation=16

To train Ms. Pac-Man:

# MuZero with 50 simulations
tools/quick-run.sh train atari mz 300 -n ms_pacman_mz_n50 -conf_str env_atari_name=ms_pacman:actor_num_simulation=50

# Gumbel MuZero with 18 simulations
tools/quick-run.sh train atari gmz 300 -n ms_pacman_gmz_n18 -conf_str env_atari_name=ms_pacman:actor_num_simulation=18

For more training details, please refer to this instructions.

Evaluation

To evaluate the strength growth during training:

# the strength growth for "go_9x9_az_n200"
tools/quick-run.sh self-eval go go_9x9_az_n200 -conf_str env_board_size=9:actor_num_simulation=800:actor_select_action_by_count=true:actor_select_action_by_softmax_count=false:actor_use_dirichlet_noise=false:actor_use_gumbel_noise=false

To compare the strengths between two trained AI models:

# the relative strengths between "go_9x9_az_n200" and "go_9x9_gaz_n16"
tools/quick-run.sh fight-eval go go_9x9_az_n200 go_9x9_gaz_n16 -conf_str env_board_size=9:actor_num_simulation=800:actor_select_action_by_count=true:actor_select_action_by_softmax_count=false:actor_use_dirichlet_noise=false:actor_use_gumbel_noise=false

Note that the evaluations is generated during training in Atari games. Check ms_pacman_mz_n50/analysis/*_Return.png for the results.

For more evaluation details, please refer to this instructions.

Console

To interact with a trained model using Go Text Protocol (GTP).

# play with the "go_9x9_az_n200" model
tools/quick-run.sh console go go_9x9_az_n200 -conf_str env_board_size=9:actor_num_simulation=800:actor_select_action_by_count=true:actor_select_action_by_softmax_count=false:actor_use_dirichlet_noise=false:actor_use_gumbel_noise=false

For more console details, please refer to this instructions.

Development

We are actively adding new algorithms, features, and games into MiniZero.

The following work-in-progress features will be available in future versions:

We welcome developers to join the MiniZero community. For more development tips, please refer to this instructions.

References