[training] add feature to save best models

This PR adds the capability to save the best models during training.

A best model is determined based on the WER (Word Error Rate) computed on validation sets, which corresponds to the --eval_dataset_name flag. The --save_best_total_limit flag specifies the number of best models to save (defaults to 1). A best model is saved as a checkpoint, allowing training to be resumed from it. The saved models and checkpoint to resume training are named in the formatcheckpoint-2-epoch-0-val-wer-712.132, with 712.132 being the WER.

The model saved under the base directory (i.e. the one set with --output_dir flag) is the best one. The weights of the model at the end of training are saved under a end-of-training-weights directory.

huggingface / distil-whisper

[training] add feature to save best models #141