[minillm] how to resume training when it occurs some bugs during running file 'train_minillm.py'?

SleepEarlyLiveLong commented 1 year ago

I ran 'train_minillm.py' successfully under the guidance of the README.md file. However, due to some uncontrollable factors, the GPU will interrupt approximately every 6-8 hours. At this time, the locally saved files are shown in the following figure. How should I continue the interrupted training? Thanks a lot!

t1101675 commented 1 year ago

We didn't strictly implement resuming training from certain checkpoints because we didn't save the optimizer's state due to its large storage occupation.

However, you can try an approximate version by first changing the CKPT variable in [train_7B_13B.sh](https://github.com/microsoft/LMOps/blob/main/minillm/scripts/llama/minillm/train_7B_13B.sh) to the path of the latest checkpoint. Then, manually skip a certain number of training steps by adding codes like

if self.global_iter_count <= 1000:
    self.iter_count += 1
    if self.iter_count % self.args.gradient_accumulation_steps == 0:
        self.global_iter_count += 1
        self.scheduler.step()
    continue

to this line.

Note that this may not guarantee the exact same training dynamics as the previous training because we didn't load the optimizer states.

May be we will consider add the training resuming feature in the next version.

donglixp commented 1 year ago

We didn't strictly implement resuming training from certain checkpoints because we didn't save the optimizer's state due to its large storage occupation.

However, you can try an approximate version by first changing the CKPT variable in [train_7B_13B.sh](https://github.com/microsoft/LMOps/blob/main/minillm/scripts/llama/minillm/train_7B_13B.sh) to the path of the latest checkpoint. Then, manually skip a certain number of training steps by adding codes like
if self.global_iter_count <= 1000:
    self.iter_count += 1
    if self.iter_count % self.args.gradient_accumulation_steps == 0:
        self.global_iter_count += 1
        self.scheduler.step()
    continue
to this line.

Note that this may not guarantee the exact same training dynamics as the previous training because we didn't load the optimizer states.

May be we will consider add the training resuming feature in the next version.

@chentianyangWHU Contributions are welcome.

SleepEarlyLiveLong commented 1 year ago

thank you!

microsoft / LMOps

[minillm] how to resume training when it occurs some bugs during running file 'train_minillm.py'? #61