This PR adds the capability to save the best models during training.
A best model is determined based on the WER (Word Error Rate) computed on validation sets, which corresponds to the --eval_dataset_name flag. The --save_best_total_limit flag specifies the number of best models to save (defaults to 1). A best model is saved as a checkpoint, allowing training to be resumed from it. The saved models and checkpoint to resume training are named in the formatcheckpoint-2-epoch-0-val-wer-712.132, with 712.132 being the WER.
The model saved under the base directory (i.e. the one set with --output_dir flag) is the best one. The weights of the model at the end of training are saved under a end-of-training-weights directory.
This PR adds the capability to save the best models during training.
A best model is determined based on the WER (Word Error Rate) computed on validation sets, which corresponds to the
--eval_dataset_name
flag. The--save_best_total_limit
flag specifies the number of best models to save (defaults to 1). A best model is saved as a checkpoint, allowing training to be resumed from it. The saved models and checkpoint to resume training are named in the formatcheckpoint-2-epoch-0-val-wer-712.132
, with 712.132 being the WER.The model saved under the base directory (i.e. the one set with
--output_dir
flag) is the best one. The weights of the model at the end of training are saved under aend-of-training-weights
directory.