Hyperparameter of Muzero and reproducibility of the results

marintoro commented 3 months ago

Hello,

I am trying to reproduce the result of Muzero on Atari (I am using the MsPacmanNoFrameskip-v4 env as it's the one with the most published result on the original paper of Muzero).

I have 2 questions about this:

1) About the default hyperparameter:

Batch_size: I see in theatari_muzero_config.py file that batch_size = 256 where as it's written in the main paper mini-batches of size 1024 in Atari.
Temperature: There is also a temperature decay in Muzero (mentionned at page 13) but this seems not used in the default config.
Ratio nb_steps/backprop: In the default config it looks like we collect 8 episodes and no matter how long the episodes (and thus how much data we add in the replay buffer) there is update_per_collect = 1000 backprops everytime. This is weird and this doesn't look like what is mentionned in the main paper either for standard Muzero or for Muzero ReAnalyse: 2.0 samples were drawn per state, instead of 0.1

My main question for this topic is: How the Muzero's hyperparameters were chosen and do they match the hyperparameters used in the main paper. And if some doensn't match or are not known, is there a list of such different or not known hyperparemeter from the original paper?

2) About the performance:

In the Readme.md there is some results on common benchmark and tasks such as MsPacmanNoFrameskip-v4. Problem is that the performance are reported on a really tiny fraction of the steps reported in the main paper 200M (or even 20 Billions) vs 0.4M env steps and thus the final results are really not comparable, e.g. on Pacman Muzero reach scores around 230 000 against 2 500 on your small experience on 0.4M steps...

My main question for this topic is: Did you try to run some experiments on comparable number of steps than original Muzero (i.e. at least 200M env steps) and on those experiments are the results you obtain comparable with the one from the original Muzero paper?

puyuan1996 commented 3 months ago

Hello, thank you for your question.

Regarding the default hyperparameters:
- It should be noted that we did not fine-tune our parameters; most hyperparameters are the same as those in the original MuZero paper, with differences primarily based on computational cost considerations. For specific hyperparameters, you can refer to the Appendix Table 7 in our paper or the default configuration for the corresponding algorithm.
- Specifically, the batch size is set to 256 mainly to accommodate machines with smaller cuda memory. Temperature decay is turned off by default because, under our training settings (500k Env Steps), it doesn't have much impact. Regarding update_per_collect, we provide the model_update_ratio parameter (similar to the replay ratio in the literature) to automatically determine the number of training steps based on the collected env steps. In our experiments, we did use a model update ratio of 0.25 for training, and we can update the default configuration to reflect this in future.
On the issue of training steps:
- We use relatively small training steps (Env Steps) based on computational cost considerations, focusing primarily on sample efficiency as described in papers like EfficientZero. Therefore, we did not conduct experiments with 200M steps. However, we can indeed conduct experiments with up to 200M steps on MsPacman in future work to test the effectiveness.

Best wishes!

marintoro commented 3 months ago

Thanks for the really fast answer! The Table 7 from your paper with all the hyperparameters is exactly what I was looking for!

Indeed it could be convenient that the default configuration for each algorithm (e.g.atari_muzero_config.py ) match the one you actually used in your experiments.

opendilab / LightZero

Hyperparameter of Muzero and reproducibility of the results #229