unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.28k stars 1.28k forks source link

Is it possible to resume a unsloth QLora Fine tune? If so how? #591

Closed devzzzero closed 5 months ago

devzzzero commented 5 months ago

Hi, I'm following the instructions on this notebook https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing Apologies for being a TRL newbie.

I've gotten it to work on fine tuning, at least for several thousand iterations, when I ran out of memory.

  1. I managed to save the LoRA by doing

      model.save_pretrained("lora_model") # Local saving
      tokenizer.save_pretrained("lora_model")
  2. Now, I am reloading the lora_model ...

      # this is probably not right!
      model = AutoModelForPeftCausalLM.from_pretrained(
          "lora_model", # YOUR MODEL YOU USED FOR TRAINING
          load_in_4bit = load_in_4bit,
      )
      tokenizer = AutoTokenizer.from_pretrained("lora_model")
  3. OR

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    # I want to resume training, so not setting it to inference
    # FastLanguageModel.for_inference(model) # Enable native 2x faster inference
    1. Do I need to call `model = FastLanguageModel.get_peft_model(model, ...) again? (I'm guessing not)
    2. In traditional pytorch, there is a way to save the current state of the model and reload it later to resume training. Is there a similar facility in unsloth?
    3. Do I need to modify any args to SFTTrainer to resume the LoRA/PEFT training?
    4. In other words, after step 3 (assuming (3) is the right method to load up the LoRA model,

      a. trainer = SFTTrainer(... ) # same args as the first time b. trainer_stats = trainer.train() seems to be the right approach. Is that correct? Is steps (7a) and (7b) the right approach to resume? (I rather not have to redo the fine tune from scratch again!)

Thank you.

devzzzero commented 5 months ago

I hope I'm not doing anything wrong, but, the resumption seem to have worked (i.e. step (3), followed by (7a) and (7b).


Step | Training Loss
-- | --
1 | 2.355000
2 | 2.317200
3 | 2.715900
4 | 1.648800
5 | 1.208100
6 | 1.313900
7 | 0.673000

Anyone with corrections/comments please chime in. Thank you!

avcode-exe commented 5 months ago

In traditional pytorch, there is a way to save the current state of the model and reload it later to resume training. Is there a similar facility in unsloth?

Do I need to modify any args to SFTTrainer to resume the LoRA/PEFT training?

For your information, there is a way to save the current status of the training and model by using checkpoints (from Trl, I don't think Unsloth support save status, hope they add this feature :) ). You need to add save_steps, output_dir, resume_from_checkpoint parameters. image Then in trainer.train(), you need to add the folder to checkpoint. image In Kaggle, you can use the persistance saving. On colab you can push to HF can pull it when you want to resume.

devzzzero commented 5 months ago

Thank you!! And double doh! It's in the wiki!

https://github.com/unslothai/unsloth/wiki#finetuning-from-your-last-checkpoint

danielhanchen commented 5 months ago

Yep we support training from checkpoints!!