unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.11k stars 1.01k forks source link

Lora downcasting issue #320

Open kiddyboots216 opened 4 months ago

kiddyboots216 commented 4 months ago

When creating a PEFT model and then trying to train it, we get an error;

  File "/scratch/gpfs/ashwinee/unsloth/unsloth/kernels/fast_lora.py", line 106, in backward                  
    d_downA = h.t() @ (dY @ downB.t())
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float

I suspect this is what the recent Lora downcasting fix PR was addressing. However, I'm still getting an error because dY is a bfloat16 and downB is a float32 (which we coerced it to be in prepare_for_kbit_training).

danielhanchen commented 4 months ago

@kiddyboots216 Are you using use_bf16 = True or use_fp16 = True in the Trainer?

kiddyboots216 commented 4 months ago
from unsloth import FastMistralModel
model, tokenizer = FastMistralModel.from_pretrained(
    args.model_path, 
    max_seq_length=512, 
    dtype=torch.bfloat16, 
    load_in_4bit=False, 
    attn_implementation="flash_attention_2", 
    device_map='auto', 
    use_cache=False
    )

model = FastMistralModel.get_peft_model(
  model,
  r = 8,
  target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj",],
  lora_alpha = 16,
  lora_dropout = 0, # Dropout = 0 is currently optimized
  bias = "none",    # Bias = "none" is currently optimized
  use_gradient_checkpointing = False,
  random_state = 3407,
  max_seq_length=512,
  use_rslora=False,
  loftq_config=None
)
danielhanchen commented 4 months ago

@kiddyboots216 Oh wait use FastLanguageModel Also you can copy paste our COlab notebook if that works https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing

danielhanchen commented 4 months ago

For a full example:

from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()
kiddyboots216 commented 4 months ago

Thanks, I was using FastLanguageModel initially but just was using FastMistralModel for debugging. And the only part of SFTTrainer that seems to actually be happening (because the error is on the first backward) is calling backward on the logit loss.

If the error isn't reproducible I can write down a working example for reproducibility. I figured it's something you're aware of since I see the closed PR to add support for Lora downcasting.

danielhanchen commented 4 months ago

Oh thats actually upcasting!! So A and B were incorrect in float16, causing incorrect training runs

kiddyboots216 commented 4 months ago

Gotcha. If we look at

temp = (dY @ downB.t())

Then the error indicates that downB is a float32 (which is correct) but dY is a bfloat16. Should it be a float32?

danielhanchen commented 4 months ago

@kiddyboots216 Ohh no so what we're doing is correct. It seems like you're not using mixed precision for training (fp16 = True, bf16 = True)

kiddyboots216 commented 4 months ago

Sorry typo -meant "dY is a bfloat16" (from original error message).

kiddyboots216 commented 4 months ago

So @danielhanchen making sure I understand this correctly;

kiddyboots216 commented 4 months ago

I'm currently getting this error with peft=0.10.0 and installing unsloth from source (git clone, pip install -e .)

Here's the stacktrace;

==((====))==  Unsloth: Fast Mistral patching release 2024.4                                                             
   \\   /|    GPU: NVIDIA A100 80GB PCIe. Max memory: 79.318 GB. Platform = Linux.                                      
O^O/ \_/ \    Pytorch: 2.2.2. CUDA = 8.0. CUDA Toolkit = 12.1.                                                          
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.                                                      
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:13<00:00,  6.86s/it]
Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
File "/scratch/gpfs/ashwinee/alignment-durability/rlaif/filter_dataset_2.py", line 463, in save_gradient_norms       
    loss.mean().backward()
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward      
    torch.autograd.backward(
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                     
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)                                                         
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                     
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 142, in
decorate_bwd
    return bwd(*args, **kwargs)
  File "/scratch/gpfs/ashwinee/unsloth/unsloth/kernels/fast_lora.py", line 106, in backward                            
    d_downA = h.t() @ (dY @ downB.t())
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float

and full code;

from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        args.model_path, 
        max_seq_length=512, 
        load_in_4bit=True, 
        )
model = FastLanguageModel.get_peft_model(
        model,
        r = 8,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 16,
        lora_dropout = 0, # Dropout = 0 is currently optimized
        bias = "none",    # Bias = "none" is currently optimized
        use_gradient_checkpointing = True,
        random_state = 3407,
        max_seq_length=512,
        use_rslora=False,
        loftq_config=None
    )
loss = -model(batch).logits.to(torch.float32)
loss.mean().backward()
danielhanchen commented 4 months ago

@kiddyboots216

For training, dY is in bfloat16. LoRA A and B must be in float32. This is for mixed precision training.

The code you provided will not run at all, because you are upcasting the loss to torch.float32, and not doing mixed precision training. Wrap your code with

with torch.cuda.amp.autocast(dtype = torch.bfloat16):
    loss = -model(batch).logits.to(torch.float32)
    loss.mean().backward()
world2vec commented 3 months ago

For my training code(I did not use huggingface trainer),

  model, tokenizer = FastLanguageModel.from_pretrained("xxxx", dtype=getattr(torch, 'bfloat16'),
                                                      max_seq_length=768, load_in_4bit=True)
 model = FastLanguageModel.get_peft_model(model, r=64, lora_alpha=16, lora_dropout=0, bias="none",
                                                     random_state=32
                                                     target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], use_dora=False)

If I set lora_dropout to 0.05 and without amp, the training code work well If I set lora_dropout to 0 without amp it error output:

 d_downA = h.t() @ (dY @ downB.t())
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float

if I set lora_dropout to 0 and use torch.cuda.amp.autocast, it will error out:

 "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'
BrunoBSM commented 3 weeks ago

I am also facing the same issues. I experimented what @world2vec posted and can confirm, with dropout the training runs, with lora dropout=0 it does not run. Tested with float16, bfloat16, and float32. Also, this issue appears to be specific to the Volta architecture. When running on a V100 I cannot make the training work with lora dropout=0 (even with the proper installation inside a container). However, on an RTX3090 it runs with no issues.

danielhanchen commented 3 weeks ago

@BrunoBSM Wait so does normal Unsloth work on V100s? T4s work for now.

@world2vec Apologies on the delay - this got lost! When dropout = 0, Unsloth will call the optimized fastpaths - it seems like the autocasting isn't propagating correctly weirdly - if this is a custom Pytorch trainer, presumably somewhere the autocast call wasn't used correctly, but I'm unsure sorry

world2vec commented 3 weeks ago

@danielhanchen For my case is on RTX4090. torch AMP with float16 work well, it does not work for bfloat16.

BrunoBSM commented 3 weeks ago

@danielhanchen I am not sure what you mean by normal Unsloth, though I have not been able to make it work on the V100 with lora dropout = 0. Should I set the dropout to anything > 0 I get the warning for a performance drop, but training does run.

danielhanchen commented 3 weeks ago

Ye so a dropout = 0 is optimized , but anything else is not - it still runs correct.

@world2vec sadly I'm unsure why your RTX 4090 isn't working sorry :(