Bad performance on long run on MsPacman and SpaceInvaders

marintoro commented 2 weeks ago

Hello,

This issue is closely related to #229, I am trying to reproduce results of the original Muzero paper on Atari (or at least on a small subset of games, for the moment I tried MsPacman and SpaceInvaders).

I started with the lastest commit of main 6090ab1 and I made just small changes on atari_muzero_config.py listed below: update_per_collect = 1000 ⇒ model_update_ratio = 0.25 max_env_step = int(1e6) ⇒ max_env_step = int(1e8) optim_type='SGD'/learning_rate = 0.2 ⇒ optim_type='Adam' / learning_rate=0.003 / lr_piecewise_constant_decay=False

The results on both MsPacman and SpaceInvaders seems to be pretty low and not learning anymore after 600k learning steps (i.e 600k*(1/0.25)=2.4M env steps because update_ratio is 0.25 and 10M env frames with frame_skip of 4). I still have both experiments running and now one is around 1.4M learning steps while the other is at 1.1M but results doesn't seem to improve anymore...

Do you know what could be the reason of such failure? Did you use any lr_piecewise_constant_decay in your experiments? Or did you use eps_greedy_exploration_in_collect? Another big question, is why collecting so much data before doing any backprop at all? In the current implementation you wait the end of 8 collected episodes to start making backprop which can be a tons of transitions as episodes in Atari can be really long. I think this is not good for a stable learning, I think we should start making backprop after a fixed number of collected transitions (a quick and dirty improvement could be to collect ONLY ONE episode and make some backprop after).

Below are the curves I obtain after around 1.4M learning steps, equivalent to 5.6M env steps (i.e. around 22M frames because there is a frame_skip of 4).

for MsPacmanNoFrameskip-v4: curve_MsPacman

for SpaceInvadersNoFrameskip-v4: space_invaders_curve

marintoro commented 2 weeks ago

Hello,

Just to give more information, both my trainings are now around 1.8M learning steps and 7.3M env steps (i.e. close to 30M Atari frames) and the performance have still not make any improvment. I will stop them because this is highly unlikely that they will increase after more than 1M learning steps without any progress.

To sum up, performance on both MsPacman and SpaceInvaders seems to be really low and stop improve after only 600k learning steps. Note that I only made one seed but this is probably enough to say that there is an issue and the actual code probably can't reproduce results of original Muzero.

puyuan1996 commented 2 weeks ago

Hello, thank you very much for your detailed experiment.
Regarding the reasons why the current performance is not as good as the original paper, we believe there are a few main points to consider:
- Firstly, the impact of the reanalyze_ratio. In our previous experiments, under a low data budget (100k), the impact of reanalyze_ratio might not have been significant. To reduce computational overhead, we set reanalyze_ratio=0, but this obviously affects performance under high data budgets because the update target for the policy net, the MCTS visit count, has not been updated and will remain in the replay buffer for a long time. Therefore, to reproduce the performance of the original paper, the reanalyze_ratio needs to be set to 1.
- Secondly, as you mentioned, our current setup updates the model after collecting 8 episodes, which significantly slows down the model improvement frequency. We previously used this setup mainly to take advantage of the speed improvement brought by collecting multiple episodes in parallel. To verify this hypothesis, it is indeed possible, as you suggested, to simply set collect_env_num to 1 for validation.
- Thirdly, the current implementation of priority still needs further validation and improvement.
- The results in our LightZero paper were obtained using SGD+lr_piecewise_constant_decay=True and eps_greedy_exploration_in_collect=False. We speculate that these two settings might not be the main reasons for the suboptimal performance of LightZero under high data budgets. We will conduct in-depth analysis and testing of other possible causes in the near future.
Thank you again for your question. You can follow our suggestions for testing, and we will also validate and update the relevant code and experimental information in the coming days. Best regards.

puyuan1996 commented 1 day ago

Hello, all our current experiments (on MsPacman) are being conducted with use_priority=False, which means we are using uniform sampling. The experimental results are as follows:

Data Collection (collect)
Evaluation (Eval)
Training (train)

Experimental Observations:

Reanalyze Ratio: The performance of rr0.25 is similar to rr1, and both are better than the previously default rr0. With the increase in training iterations, the performance shows a trend of continuous improvement. The experiment is still ongoing, and we will discuss further once we have long-term experimental results.
Reward Clipping: The performance of not using reward clipping (noclipreward) is not significantly better, indicating that under the current settings, clip_reward is not a key factor affecting performance. This is consistent with our expectations and the DQN paper, as in most Atari environments, the MDP with clipped rewards is similar to the original MDP, and the optimal strategies are also similar.
collect_env_num=1: The experimental results with filenames containing collect1 show that it has little impact on long-term performance.

Future Work:

Analyze experimental results after running for a longer period.
Fix the priority mechanism of LightZero.
Further analysis (such as internal representation-related metrics) and optimization considerations.

marintoro commented 1 day ago

Hello, Thanks for this detailled experiments.

I think it is still way too early to know if your experiments will continue to improve or if they will stagnate like the one I launched. I believe a 'good' score in Pacman would be around 15k/20k, which is still far from the 250k score mentioned in the original MuZero paper (although they used 20 billion frames, which is insane...). However, consistently completing the first level of the game seems like a good target to determine if the current implementation is 'working' as expected.

Finally, I have a last question, why do you set use_priority = False? Are you sure the current implementation is not functionnal? I checked quickly the implementation and could not find any bugs (but for sure I didn't go in deep details).

puyuan1996 commented 1 day ago

Consistent with your analysis, we will indeed pay attention to the subsequent long-term experimental results to confirm the robustness of the performance and will update relevant info here.

As for the reason for using use_priority = False, it is as follows:

Currently, during data collection, the priorities we store are calculated based on the L1 error between the search value and the predicted value (here). However, when updating priorities during training (here), we are using the L1 error between the n-step bootstrapped value and the predicted value. The appendix of the original MuZero paper points out that priorities should be calculated based on the L1 error between the search value and the n-step bootstrapped value.

To maintain consistency, we might need to use the L1 error between the latest reanalyzed search value and the n-step bootstrapped value when updating priorities during training. However, these implementation details still require further verification through subsequent experiments and analysis.

marintoro commented 16 hours ago

It's true that in the original paper they say that priorities is based on the L1 error between the search value and the n-step boostrapped value but the current implementation seems at least ok-ish for me as it's using the usual priorities, ie the TD-error between the n-step boostrapped value and the predicted value, as in the original PER paper.

By the way, I wonder if we should use the L1 error with the true values or the L1 error with the scaled values. It seems more intuitive to me to use the TD-error with the scaled value as it's the one we actually use to compute the loss for backpropagation. Moreover, if we don't use reward clipping, the TD-error with the true values can be really high (in some games, the expected value can be greater than 1 million) and I think this could impact the stability of the training.

opendilab / LightZero

Bad performance on long run on MsPacman and SpaceInvaders #233