Closed Tiikara closed 2 months ago
Hello, regarding this issue, theoretically, if the model is correctly loaded, the reward_mean should stay around 20 during the first evaluation after loading the model. During subsequent training, it might vary due to differences in the optimizer's initial state and the data distribution in the buffer. However, your test results show a significant drop in reward_mean to -13 during the first evaluation, which likely indicates that the model wasn't saved or loaded correctly. We will conduct tests to verify this within the week. In the meantime, please ensure that the environment settings and hyperparameters are consistent with those used during model training, and print and compare key parameters (such as policy/value head) before and after saving the model to ensure they match completely. You can also save part of the replay buffer data and test the policy/value distribution before and after saving the model to see if the outputs are consistent for further testing.
I hope these validation results are helpful to you.
@ruiheng123 Thank you for your explanation. I'd like to clarify a few points. Does the orange line represent training from scratch? It's intriguing to see that the network converged to a score of approximately 20 within 150k environment steps. This performance appears to differ from what was reported in the original paper (UniZero). Could you elaborate on the methods or modifications you implemented to achieve such rapid convergence? Or this is just from checkpoint?
I've been experimenting with UniZero training using the default configuration (atari_unizero_config.py). The initial training process progressed well, aligning with the expectations set in the paper. However, I've encountered an issue when attempting to resume training from a checkpoint.
My initial training run stopped at 0.3M env_steps, with a reward_mean of approximately 20. To continue training, I specified the last checkpoint in the model_path (trying both iteration_80000.pth.tar and ckpt_best.pth.tar). The console output confirmed that the model was successfully loaded.
However, after resuming training, I observed an unexpected drop in performance. The reward_mean plummeted to around -13, and after about 13,000 env_steps, it stabilized near zero. This significant performance degradation is concerning and inconsistent with the previous training results.
I'm uncertain whether this issue stems from a problem in the model saving/loading algorithm or if I'm misunderstanding some aspect of the process. Could you please provide some insight into what might be causing this behavior and how to correctly resume training without losing performance? Thank you for your assistance.