Closed SleepEarlyLiveLong closed 1 year ago
We didn't strictly implement resuming training from certain checkpoints because we didn't save the optimizer's state due to its large storage occupation.
However, you can try an approximate version by first changing the CKPT
variable in [train_7B_13B.sh](https://github.com/microsoft/LMOps/blob/main/minillm/scripts/llama/minillm/train_7B_13B.sh)
to the path of the latest checkpoint. Then, manually skip a certain number of training steps by adding codes like
if self.global_iter_count <= 1000:
self.iter_count += 1
if self.iter_count % self.args.gradient_accumulation_steps == 0:
self.global_iter_count += 1
self.scheduler.step()
continue
to this line.
Note that this may not guarantee the exact same training dynamics as the previous training because we didn't load the optimizer states.
May be we will consider add the training resuming feature in the next version.
We didn't strictly implement resuming training from certain checkpoints because we didn't save the optimizer's state due to its large storage occupation.
However, you can try an approximate version by first changing the
CKPT
variable in[train_7B_13B.sh](https://github.com/microsoft/LMOps/blob/main/minillm/scripts/llama/minillm/train_7B_13B.sh)
to the path of the latest checkpoint. Then, manually skip a certain number of training steps by adding codes likeif self.global_iter_count <= 1000: self.iter_count += 1 if self.iter_count % self.args.gradient_accumulation_steps == 0: self.global_iter_count += 1 self.scheduler.step() continue
to this line.
Note that this may not guarantee the exact same training dynamics as the previous training because we didn't load the optimizer states.
May be we will consider add the training resuming feature in the next version.
@chentianyangWHU Contributions are welcome.
thank you!
I ran 'train_minillm.py' successfully under the guidance of the README.md file. However, due to some uncontrollable factors, the GPU will interrupt approximately every 6-8 hours. At this time, the locally saved files are shown in the following figure. How should I continue the interrupted training? Thanks a lot!