Closed devzzzero closed 5 months ago
I hope I'm not doing anything wrong, but, the resumption seem to have worked (i.e. step (3), followed by (7a) and (7b).
Step | Training Loss
-- | --
1 | 2.355000
2 | 2.317200
3 | 2.715900
4 | 1.648800
5 | 1.208100
6 | 1.313900
7 | 0.673000
Anyone with corrections/comments please chime in. Thank you!
In traditional pytorch, there is a way to save the current state of the model and reload it later to resume training. Is there a similar facility in unsloth?
Do I need to modify any args to SFTTrainer to resume the LoRA/PEFT training?
For your information, there is a way to save the current status of the training and model by using checkpoints (from Trl, I don't think Unsloth support save status, hope they add this feature :) ). You need to add save_steps
, output_dir
, resume_from_checkpoint
parameters.
Then in trainer.train()
, you need to add the folder to checkpoint.
In Kaggle, you can use the persistance saving. On colab you can push to HF can pull it when you want to resume.
Thank you!! And double doh! It's in the wiki!
https://github.com/unslothai/unsloth/wiki#finetuning-from-your-last-checkpoint
Yep we support training from checkpoints!!
Hi, I'm following the instructions on this notebook https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing Apologies for being a TRL newbie.
I've gotten it to work on fine tuning, at least for several thousand iterations, when I ran out of memory.
I managed to save the LoRA by doing
Now, I am reloading the lora_model ...
OR
In other words, after step 3 (assuming (3) is the right method to load up the LoRA model,
a.
trainer = SFTTrainer(... ) # same args as the first time
b.trainer_stats = trainer.train()
seems to be the right approach. Is that correct? Is steps (7a) and (7b) the right approach to resume? (I rather not have to redo the fine tune from scratch again!)Thank you.