How do I resume training?

tyshiwo1 / DiM-DiffusionMamba

The official implementation of DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

150 stars 8 forks source link

How do I resume training? #5

Closed ZhiyuanLi218 closed 3 months ago

ZhiyuanLi218 commented 3 months ago

Hello, how do I resume training?Running the train command again will load the model but the learning rate will start from the beginning.

tyshiwo1 commented 3 months ago

It's weird. Which command did you use for training? This behaves normally. accelerate launch --multi_gpu --num_processes 8 --mixed_precision bf16 ./train_ldm_discrete.py --config=configs/....py

Besides, how many iterations of your training? Can you see some checkpoints in the workdir?

tyshiwo1 commented 3 months ago

Or did you save all the checkpoint files correctly? Accelerate saves a lot of large files.

Can you provide your output.log?

ZhiyuanLi218 commented 3 months ago

It's weird. Which command did you use for training? This behaves normally. accelerate launch --multi_gpu --num_processes 8 --mixed_precision bf16 ./train_ldm_discrete.py --config=configs/....py

Besides, how many iterations of your training? Can you see some checkpoints in the workdir?

It's weird. Which command did you use for training? This behaves normally. accelerate launch --multi_gpu --num_processes 8 --mixed_precision bf16 ./train_ldm_discrete.py --config=configs/....py

Besides, how many iterations of your training? Can you see some checkpoints in the workdir?

yes, I had an unexpected interruption while training, I ran this command again to train and the log prompts resume from 125000.ckpt, but lr will start from the initial state.

tyshiwo1 commented 3 months ago

OK. You have successfully resumed the model weights and the statistics of optimizers.

As for the learning rate, I haven't noticed this phenomenon before. Please give me some time to fix this.

tyshiwo1 commented 3 months ago

The code can support resuming the lr schedulers now, which addresses this problem. Please update your training code with our latest code. The lr scheduler checkpoint will be correct in the subsequent training or another training. (We are unable to fix the already saved checkpoints. However, since we use a constant learning rate for training after the warm-up, you may temporarily address this problem by setting this value to 0 instead of 5000: https://github.com/tyshiwo1/DiM-DiffusionMamba/blob/main/configs/imagenet256_L_DiM.py#L42)

This phenomenon occurs because the previous code does not save the lr scheduler. Thus, the lr scheduler doesn't receive the checkpoints so it starts from the initial state.

tyshiwo1 commented 3 months ago

Besides, your loss seems too large, did you use the correct config I provided? Or maybe you are trying your own modules, in which case there is no problem to my code😀.

ZhiyuanLi218 commented 3 months ago

Thank you very much, this problem has been solved.