opendilab / LightZero

[NeurIPS 2023 Spotlight] LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios
https://huggingface.co/spaces/OpenDILabCommunity/ZeroPal
Apache License 2.0
953 stars 90 forks source link

Bad performance on long run on MsPacman and SpaceInvaders #233

Open marintoro opened 2 weeks ago

marintoro commented 2 weeks ago

Hello,

This issue is closely related to #229, I am trying to reproduce results of the original Muzero paper on Atari (or at least on a small subset of games, for the moment I tried MsPacman and SpaceInvaders).

I started with the lastest commit of main 6090ab1 and I made just small changes on atari_muzero_config.py listed below: update_per_collect = 1000model_update_ratio = 0.25 max_env_step = int(1e6)max_env_step = int(1e8) optim_type='SGD'/learning_rate = 0.2optim_type='Adam' / learning_rate=0.003 / lr_piecewise_constant_decay=False

The results on both MsPacman and SpaceInvaders seems to be pretty low and not learning anymore after 600k learning steps (i.e 600k*(1/0.25)=2.4M env steps because update_ratio is 0.25 and 10M env frames with frame_skip of 4). I still have both experiments running and now one is around 1.4M learning steps while the other is at 1.1M but results doesn't seem to improve anymore...

Do you know what could be the reason of such failure? Did you use any lr_piecewise_constant_decay in your experiments? Or did you use eps_greedy_exploration_in_collect? Another big question, is why collecting so much data before doing any backprop at all? In the current implementation you wait the end of 8 collected episodes to start making backprop which can be a tons of transitions as episodes in Atari can be really long. I think this is not good for a stable learning, I think we should start making backprop after a fixed number of collected transitions (a quick and dirty improvement could be to collect ONLY ONE episode and make some backprop after).

Below are the curves I obtain after around 1.4M learning steps, equivalent to 5.6M env steps (i.e. around 22M frames because there is a frame_skip of 4).

for MsPacmanNoFrameskip-v4: curve_MsPacman

for SpaceInvadersNoFrameskip-v4: space_invaders_curve

marintoro commented 2 weeks ago

Hello,

Just to give more information, both my trainings are now around 1.8M learning steps and 7.3M env steps (i.e. close to 30M Atari frames) and the performance have still not make any improvment. I will stop them because this is highly unlikely that they will increase after more than 1M learning steps without any progress.

To sum up, performance on both MsPacman and SpaceInvaders seems to be really low and stop improve after only 600k learning steps. Note that I only made one seed but this is probably enough to say that there is an issue and the actual code probably can't reproduce results of original Muzero.

puyuan1996 commented 2 weeks ago
puyuan1996 commented 1 day ago

Hello, all our current experiments (on MsPacman) are being conducted with use_priority=False, which means we are using uniform sampling. The experimental results are as follows:

Experimental Observations:

Future Work:

marintoro commented 1 day ago

Hello, Thanks for this detailled experiments.

I think it is still way too early to know if your experiments will continue to improve or if they will stagnate like the one I launched. I believe a 'good' score in Pacman would be around 15k/20k, which is still far from the 250k score mentioned in the original MuZero paper (although they used 20 billion frames, which is insane...). However, consistently completing the first level of the game seems like a good target to determine if the current implementation is 'working' as expected.

Finally, I have a last question, why do you set use_priority = False? Are you sure the current implementation is not functionnal? I checked quickly the implementation and could not find any bugs (but for sure I didn't go in deep details).

puyuan1996 commented 1 day ago

Consistent with your analysis, we will indeed pay attention to the subsequent long-term experimental results to confirm the robustness of the performance and will update relevant info here.

As for the reason for using use_priority = False, it is as follows:

Currently, during data collection, the priorities we store are calculated based on the L1 error between the search value and the predicted value (here). However, when updating priorities during training (here), we are using the L1 error between the n-step bootstrapped value and the predicted value. The appendix of the original MuZero paper points out that priorities should be calculated based on the L1 error between the search value and the n-step bootstrapped value.

To maintain consistency, we might need to use the L1 error between the latest reanalyzed search value and the n-step bootstrapped value when updating priorities during training. However, these implementation details still require further verification through subsequent experiments and analysis.

marintoro commented 16 hours ago

It's true that in the original paper they say that priorities is based on the L1 error between the search value and the n-step boostrapped value but the current implementation seems at least ok-ish for me as it's using the usual priorities, ie the TD-error between the n-step boostrapped value and the predicted value, as in the original PER paper.

By the way, I wonder if we should use the L1 error with the true values or the L1 error with the scaled values. It seems more intuitive to me to use the TD-error with the scaled value as it's the one we actually use to compute the loss for backpropagation. Moreover, if we don't use reward clipping, the TD-error with the true values can be really high (in some games, the expected value can be greater than 1 million) and I think this could impact the stability of the training.