With model checkpoints being trimmed away, it is possible that one might auto-lose the best evaluating checkpoint in the current set up. This PR compares the eval loss to the best seen and saves a "best_loss" checkpoint that is overwritten each time a new best is achieved.
I'm not sure I am getting the loss for all evals now. I think I do, but it would be great to have this confirmed.
Is there a more elegant way to relay pending_best_loss_save than doing a file write and os.path.exist check in the process_step loop? My deepspeed knowledge is limited.
With model checkpoints being trimmed away, it is possible that one might auto-lose the best evaluating checkpoint in the current set up. This PR compares the eval loss to the best seen and saves a "best_loss" checkpoint that is overwritten each time a new best is achieved.
pending_best_loss_save
than doing a file write andos.path.exist
check in the process_step loop? My deepspeed knowledge is limited.