Is it possible to resume a unsloth QLora Fine tune? If so how?

devzzzero commented 5 months ago

Hi, I'm following the instructions on this notebook https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing Apologies for being a TRL newbie.

I've gotten it to work on fine tuning, at least for several thousand iterations, when I ran out of memory.

I managed to save the LoRA by doing

  model.save_pretrained("lora_model") # Local saving
  tokenizer.save_pretrained("lora_model")

Now, I am reloading the lora_model ...

  # this is probably not right!
  model = AutoModelForPeftCausalLM.from_pretrained(
      "lora_model", # YOUR MODEL YOU USED FOR TRAINING
      load_in_4bit = load_in_4bit,
  )
  tokenizer = AutoTokenizer.from_pretrained("lora_model")

OR
```
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
# I want to resume training, so not setting it to inference
# FastLanguageModel.for_inference(model) # Enable native 2x faster inference
```
1. Do I need to call `model = FastLanguageModel.get_peft_model(model, ...) again? (I'm guessing not)
2. In traditional pytorch, there is a way to save the current state of the model and reload it later to resume training. Is there a similar facility in unsloth?
3. Do I need to modify any args to SFTTrainer to resume the LoRA/PEFT training?
4. In other words, after step 3 (assuming (3) is the right method to load up the LoRA model,
  
  a. trainer = SFTTrainer(... ) # same args as the first time b. trainer_stats = trainer.train() seems to be the right approach. Is that correct? Is steps (7a) and (7b) the right approach to resume? (I rather not have to redo the fine tune from scratch again!)

Thank you.

devzzzero commented 5 months ago

I hope I'm not doing anything wrong, but, the resumption seem to have worked (i.e. step (3), followed by (7a) and (7b).


Step | Training Loss
-- | --
1 | 2.355000
2 | 2.317200
3 | 2.715900
4 | 1.648800
5 | 1.208100
6 | 1.313900
7 | 0.673000

Anyone with corrections/comments please chime in. Thank you!

avcode-exe commented 5 months ago

In traditional pytorch, there is a way to save the current state of the model and reload it later to resume training. Is there a similar facility in unsloth?

Do I need to modify any args to SFTTrainer to resume the LoRA/PEFT training?

For your information, there is a way to save the current status of the training and model by using checkpoints (from Trl, I don't think Unsloth support save status, hope they add this feature :) ). You need to add save_steps, output_dir, resume_from_checkpoint parameters. Then in trainer.train(), you need to add the folder to checkpoint. In Kaggle, you can use the persistance saving. On colab you can push to HF can pull it when you want to resume.

devzzzero commented 5 months ago

Thank you!! And double doh! It's in the wiki!

https://github.com/unslothai/unsloth/wiki#finetuning-from-your-last-checkpoint

danielhanchen commented 5 months ago

Yep we support training from checkpoints!!

unslothai / unsloth

Is it possible to resume a unsloth QLora Fine tune? If so how? #591