Closed ZhiyuanLi218 closed 3 months ago
It's weird. Which command did you use for training? This behaves normally.
accelerate launch --multi_gpu --num_processes 8 --mixed_precision bf16 ./train_ldm_discrete.py --config=configs/....py
Besides, how many iterations of your training? Can you see some checkpoints in the workdir
?
Or did you save all the checkpoint files correctly? Accelerate saves a lot of large files.
Can you provide your output.log
?
It's weird. Which command did you use for training? This behaves normally.
accelerate launch --multi_gpu --num_processes 8 --mixed_precision bf16 ./train_ldm_discrete.py --config=configs/....py
Besides, how many iterations of your training? Can you see some checkpoints in the
workdir
?It's weird. Which command did you use for training? This behaves normally.
accelerate launch --multi_gpu --num_processes 8 --mixed_precision bf16 ./train_ldm_discrete.py --config=configs/....py
Besides, how many iterations of your training? Can you see some checkpoints in the
workdir
?
yes, I had an unexpected interruption while training, I ran this command again to train and the log prompts resume from 125000.ckpt, but lr will start from the initial state.
OK. You have successfully resumed the model weights and the statistics of optimizers.
As for the learning rate, I haven't noticed this phenomenon before. Please give me some time to fix this.
The code can support resuming the lr schedulers now, which addresses this problem. Please update your training code with our latest code.
The lr scheduler checkpoint will be correct in the subsequent training or another training.
(We are unable to fix the already saved checkpoints. However, since we use a constant learning rate for training after the warm-up, you may temporarily address this problem by setting this value to 0
instead of 5000
:
https://github.com/tyshiwo1/DiM-DiffusionMamba/blob/main/configs/imagenet256_L_DiM.py#L42)
This phenomenon occurs because the previous code does not save the lr scheduler. Thus, the lr scheduler doesn't receive the checkpoints so it starts from the initial state.
Besides, your loss seems too large, did you use the correct config I provided? Or maybe you are trying your own modules, in which case there is no problem to my code😀.
Thank you very much, this problem has been solved.
Hello, how do I resume training?Running the train command again will load the model but the learning rate will start from the beginning.