Unexpected performance drop after resuming UniZero training

Tiikara commented 3 months ago

I've been experimenting with UniZero training using the default configuration (atari_unizero_config.py). The initial training process progressed well, aligning with the expectations set in the paper. However, I've encountered an issue when attempting to resume training from a checkpoint.

My initial training run stopped at 0.3M env_steps, with a reward_mean of approximately 20. To continue training, I specified the last checkpoint in the model_path (trying both iteration_80000.pth.tar and ckpt_best.pth.tar). The console output confirmed that the model was successfully loaded.

However, after resuming training, I observed an unexpected drop in performance. The reward_mean plummeted to around -13, and after about 13,000 env_steps, it stabilized near zero. This significant performance degradation is concerning and inconsistent with the previous training results.

I'm uncertain whether this issue stems from a problem in the model saving/loading algorithm or if I'm misunderstanding some aspect of the process. Could you please provide some insight into what might be causing this behavior and how to correctly resume training without losing performance? Thank you for your assistance.

puyuan1996 commented 3 months ago

Hello, regarding this issue, theoretically, if the model is correctly loaded, the reward_mean should stay around 20 during the first evaluation after loading the model. During subsequent training, it might vary due to differences in the optimizer's initial state and the data distribution in the buffer. However, your test results show a significant drop in reward_mean to -13 during the first evaluation, which likely indicates that the model wasn't saved or loaded correctly. We will conduct tests to verify this within the week. In the meantime, please ensure that the environment settings and hyperparameters are consistent with those used during model training, and print and compare key parameters (such as policy/value head) before and after saving the model to ensure they match completely. You can also save part of the replay buffer data and test the policy/value distribution before and after saving the model to see if the outputs are consistent for further testing.

ruiheng123 commented 3 months ago

We performed an initial analysis of the issue you pointed out using the Atari Pong environment. In my experiments, I ensured consistency by keeping the hyperparameter settings identical during both the pre-training phase (depicted by the orange line) and the fine-tuning phase (depicted by the red line). I loaded a checkpoint that had previously achieved an average reward of approximately 20, as illustrated by the orange line. Upon loading the final checkpoint represented by the red line, the average reward during the first evaluation was around 8. This discrepancy from 20 could be attributed to factors such as seed variation. Nevertheless, with continued training, the model quickly reconverged to an average reward of 20, which validates the correctness of resuming training from a pre-trained model.
The issue you've raised is probably due to discrepancies between the settings used during the checkpoint's training and those used when loading it. For instance, in my initial experiment, I noticed that if the inference context length parameter does not match the one used during training, the initial reward during the first evaluation from the loaded checkpoint plummets to -18.

I hope these validation results are helpful to you.

Tiikara commented 3 months ago

@ruiheng123 Thank you for your explanation. I'd like to clarify a few points. Does the orange line represent training from scratch? It's intriguing to see that the network converged to a score of approximately 20 within 150k environment steps. This performance appears to differ from what was reported in the original paper (UniZero). Could you elaborate on the methods or modifications you implemented to achieve such rapid convergence? Or this is just from checkpoint?

opendilab / LightZero

Unexpected performance drop after resuming UniZero training #251