Open MNCTTY opened 5 years ago
given it takes a very long time to train the model, it is essential to be able to do checkpoints. Is the checkpoint supported? How can I make sure a long simulation is checkpointed? Why is it not documented?
Here is what I see in the code:
in train.py : parser.add_argument('--checkpoint', dest='checkpoint', default=0, type=int, help='Enables checkpoint saving of model')
in solver.py :
if self.checkpoint:
file_path = os.path.join(
self.save_folder, 'epoch%d.pth.tar' % (epoch + 1))
torch.save(self.model.serialize(self.model, self.optimizer, epoch + 1,
tr_loss=self.tr_loss,
cv_loss=self.cv_loss),
file_path)
print('Saving checkpoint model to %s' % file_path)
KALDI
Hi
I tried to train model from previous checkpoint
For example, I trained the model during 100 epochs and got the final.pth.tar file. I put the abs path to it in the run.sh in lines:
but training exiting with this log:
what object can give this tensor size problem? do I correctly use training from checkpoint?