opendilab / LightZero

[NeurIPS 2023 Spotlight] LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios (awesome MCTS)
https://huggingface.co/spaces/OpenDILabCommunity/ZeroPal
Apache License 2.0
1.15k stars 120 forks source link

Unexpected performance drop after resuming UniZero training #251

Closed Tiikara closed 2 months ago

Tiikara commented 3 months ago

I've been experimenting with UniZero training using the default configuration (atari_unizero_config.py). The initial training process progressed well, aligning with the expectations set in the paper. However, I've encountered an issue when attempting to resume training from a checkpoint.

My initial training run stopped at 0.3M env_steps, with a reward_mean of approximately 20. To continue training, I specified the last checkpoint in the model_path (trying both iteration_80000.pth.tar and ckpt_best.pth.tar). The console output confirmed that the model was successfully loaded.

However, after resuming training, I observed an unexpected drop in performance. The reward_mean plummeted to around -13, and after about 13,000 env_steps, it stabilized near zero. This significant performance degradation is concerning and inconsistent with the previous training results.

I'm uncertain whether this issue stems from a problem in the model saving/loading algorithm or if I'm misunderstanding some aspect of the process. Could you please provide some insight into what might be causing this behavior and how to correctly resume training without losing performance? Thank you for your assistance.

image

puyuan1996 commented 3 months ago

Hello, regarding this issue, theoretically, if the model is correctly loaded, the reward_mean should stay around 20 during the first evaluation after loading the model. During subsequent training, it might vary due to differences in the optimizer's initial state and the data distribution in the buffer. However, your test results show a significant drop in reward_mean to -13 during the first evaluation, which likely indicates that the model wasn't saved or loaded correctly. We will conduct tests to verify this within the week. In the meantime, please ensure that the environment settings and hyperparameters are consistent with those used during model training, and print and compare key parameters (such as policy/value head) before and after saving the model to ensure they match completely. You can also save part of the replay buffer data and test the policy/value distribution before and after saving the model to see if the outputs are consistent for further testing.

ruiheng123 commented 3 months ago

I hope these validation results are helpful to you.

Tiikara commented 3 months ago

@ruiheng123 Thank you for your explanation. I'd like to clarify a few points. Does the orange line represent training from scratch? It's intriguing to see that the network converged to a score of approximately 20 within 150k environment steps. This performance appears to differ from what was reported in the original paper (UniZero). Could you elaborate on the methods or modifications you implemented to achieve such rapid convergence? Or this is just from checkpoint?