unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
16.74k stars 1.15k forks source link

LLaMA-3.1-8B finetune: Unsloth does NOT load pre-finetuned QLoRA adapter correctly but its default ? #1045

Open thusinh1969 opened 3 weeks ago

thusinh1969 commented 3 weeks ago

Evidence:

We finetune without Unsloth in Qlora with rank 32, targets all linear layers AND embed/lm_head (smaller 10x lr, same padđing right key as Unsloth) in a total of 1,134,559,232 trainable parameters. Our finetuned context-lenght is 32768 on H100 NVL and our final QLoRA adapter size is 2.6G.

### Padding token:  <|finetune_right_pad_id|>
### Padding token ID:  128004
**trainable params: 1,134,559,232 || all params: 9,164,820,480 || trainable%: 12.3795**
Use BNB Adam8bit bnb.optim.Adam8bit
EraX pretrain: Setting lr = 5.00e-06 instead of 5.00e-05 for embed_tokens.
EraX pretrain: Setting lr = 5.00e-06 instead of 5.00e-05 for lm_head.
*** Training starts...
[WARNING|trainer.py:598] 2024-09-21 06:38:41,371 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:648] 2024-09-21 06:38:41,371 >> Using auto half precision backend
<transformers.trainer_callback.DefaultFlowCallback object at 0x7d855c60cb20>
<transformers.integrations.integration_utils.TensorBoardCallback object at 0x7d855c60cb80>
<transformers.trainer_callback.ProgressCallback object at 0x7d855c60cc70>
**trainable params: 1,134,559,232 || all params: 9,164,820,480 || trainable%: 12.3795**
Last checkpoint: None
[INFO|trainer.py:2134] 2024-09-21 06:38:42,047 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-09-21 06:38:42,047 >>   Num examples = 13
[INFO|trainer.py:2136] 2024-09-21 06:38:42,047 >>   Num Epochs = 1
[INFO|trainer.py:2137] 2024-09-21 06:38:42,047 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2140] 2024-09-21 06:38:42,047 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:2141] 2024-09-21 06:38:42,047 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:2142] 2024-09-21 06:38:42,047 >>   Total optimization steps = 10
[INFO|trainer.py:2143] 2024-09-21 06:38:42,051 >>   Number of trainable parameters = 1,134,559,232

Unsloth continue finetune: After this finetuning successful, we continue with Unsloth.

torch_dtype = (
        model_args.torch_dtype # bfloat16
        if model_args.torch_dtype in ["auto", None]
        else getattr(torch, model_args.torch_dtype)
    )
from unsloth import FastLanguageModel
max_seq_length = data_args.max_seq_length # 81920
dtype = torch_dtype
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = training_args.pretrained_qlora_path, # this is the saved final QLoRA finetuned folder 
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,

)

Unsloth show only 167,772,160 parameters being trained !!! It seems it loads Unsloth default setting and no my QLoRa setup & finetuned weight at all ?:

### Padding token:  <|finetune_right_pad_id|>
### Padding token ID:  128004
**trainable params: 167,772,160 || all params: 9,248,706,560 || trainable%: 1.8140**
Use BNB Adam8bit bnb.optim.Adam8bit
*** Training starts...
[WARNING|trainer.py:598] 2024-09-21 06:48:55,084 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:648] 2024-09-21 06:48:55,084 >> Using auto half precision backend
<transformers.trainer_callback.DefaultFlowCallback object at 0x7165f8c3fb50>
<transformers.integrations.integration_utils.TensorBoardCallback object at 0x71691c8a60e0>
<transformers.trainer_callback.ProgressCallback object at 0x7165f8c3fa30>
**trainable params: 167,772,160 || all params: 9,248,706,560 || trainable%: 1.8140**
Last checkpoint: None
[WARNING|<string>:213] 2024-09-21 06:48:55,660 >> ==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 13 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 1
\        /    Total batch size = 1 | Total steps = 10
 "-____-"     Number of trainable parameters = **167,772,160**
{'loss': 0.0015, 'grad_norm': 0.06821424514055252, 'learning_rate': 4.877641290737884e-05, 'epoch': 0.08}                                                  

Even weider, Unsloth finally saved the checkpoint and final adapter adapter_model.safetensors with exact 2.6G in size. Why ? What is actually going on ?

Anh idea why ?

Thanks, Steve

danielhanchen commented 2 weeks ago

Oh it's possible the lm_head and embed_tokens were not enabled for training in the 2nd run

thusinh1969 commented 2 weeks ago

I found it. Same setup. Just resume_from_check_point seems to start the continueing finetuning. Will have to check quality later though.

Thanks a lot. Steve