A clean implementation of MuZero and AlphaZero following the AlphaZero General framework. Train and Pit both algorithms against each other, and investigate reliability of learned MuZero MDP models.
MIT License
148
stars
24
forks
source link
About the x-axis of the CartPole learning curve figure #8
def learn(self) -> None:
for i in range(1, self.args.num_selfplay_iterations + 1):
print(f'------ITER {i}------')
if not self.update_on_checkpoint or i > 1: # else: go directly to backpropagation
# Self-play/ Gather training data.
iteration_train_examples = list()
scores = list()
for _ in trange(self.args.num_episodes, desc="Self Play", file=sys.stdout):
self.mcts.clear_tree()
game_history, score = self.executeEpisode()
Coach.py
def learn(self) -> None: for i in range(1, self.args.num_selfplay_iterations + 1): print(f'------ITER {i}------') if not self.update_on_checkpoint or i > 1: # else: go directly to backpropagation