tdrussell / qlora-pipe

A pipeline parallel training script for LLMs.
MIT License
83 stars 8 forks source link

feature: track and hold onto best eval model #13

Closed kallewoof closed 4 months ago

kallewoof commented 5 months ago

With model checkpoints being trimmed away, it is possible that one might auto-lose the best evaluating checkpoint in the current set up. This PR compares the eval loss to the best seen and saves a "best_loss" checkpoint that is overwritten each time a new best is achieved.

  1. I'm not sure I am getting the loss for all evals now. I think I do, but it would be great to have this confirmed.
  2. Is there a more elegant way to relay pending_best_loss_save than doing a file write and os.path.exist check in the process_step loop? My deepspeed knowledge is limited.