RuntimeError: self and mat2 must have the same dtype

imrankh46 commented 1 year ago

i got this error when i run the following code

import transformers
from datasets import load_dataset
data = load_dataset('csv',data_files='/content/fyp.csv')
data = data.map(lambda samples: tokenizer(samples['completion']), batched=True)

trainer = transformers.Trainer(
    model=model, 
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4, 
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=200, 
        learning_rate=2e-4, 
        fp16=True,
        logging_steps=1, 
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

RuntimeError Traceback (most recent call last) in 20 ) 21 model.config.use_cache = False # silence the warnings. Please re-enable for inference! ---> 22 trainer.train()

32 frames /usr/local/lib/python3.8/dist-packages/peft/tuners/lora.py in forward(self, x) 446 return F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) 447 else: --> 448 result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) 449 if self.r > 0: 450 after_A = self.lora_A(self.lora_dropout(x))

RuntimeError: self and mat2 must have the same dtype

pacman100 commented 1 year ago

Hello @imrankh46, It would be great if you can provide minimal reproducible script

imrankh46 commented 1 year ago

reproducible

I'm using the same example fine-tune-opt-bnb-peft.ipynb

But in the notebook I just change the model name to bloom_7b.

Then they give this error.

RuntimeError: self and mat2 must have the same dtype

kuronekosaiko commented 1 year ago

TL;DR

This issue is in fact a duplication of #141, because it is caused by the same piece of code and it gave the same error.

Just that neither of you gave the full information for meaningful troubleshooting.

@imrankh46 It can be fix by applying following changes to peft/tuners/lora.py at line 148 (as shown in #141 with minor tweak):

                bias = target.bias is not None
    -           if loaded_in_8bit and isinstance(target, bnb.nn.Linear8bitLt) and self.peft_config.enable_lora is None:
    +           if loaded_in_8bit and isinstance(target, bnb.nn.Linear8bitLt):
                    kwargs.update(

This does work, but this is merely a bandage, as it does not fix the underlying problem with the code, which is: there isn't a way to do MergedLinear in 8 bit in the origin code.

So, you might encounter error at inference time as well, at that time you may need to apply similar fixes (until the huggingface's team fixes it).

Causes

This is due to the model architecture of bloom: query, key and value are calculated in the same module named "query_key_value".

As in the original LoRA paper, one should only need to apply LoRA to query and value, and it will perform on par with apply LoRA to all of query, key, value and output.

Because in bloom q, k and v are merged together, PEFT uses MergedLinear to separate the q, k and v, then only train on q and v.

And here comes the problem: MergedLinear does not work with load_in_8bit.

PEFT will simply ignore load_in_8bit and continue to use 32bit or 16bit conv1D and Linear, hence the dtype mismatch.

The above also apples to GPT-NeoX and similar model with merged attention layers, so without fixes, you will not be able to train those model with load_in_8bit set to True.

Possible Fixes

As above mentioned, ignore self.peft_config.enable_lora when load_in_8bit set to True is just merely a bandage.

1

Here is a easy fix to peft/tuners/lora.py that I can think of:

                bias = target.bias is not None
    -           if loaded_in_8bit and isinstance(target, bnb.nn.Linear8bitLt) and self.peft_config.enable_lora is None:
    +           if loaded_in_8bit and isinstance(target, bnb.nn.Linear8bitLt):
    +               if self.peft_config.enable_lora is not None
    +                   warnings.warn(
    +                       "loaded_in_8bit is set to True but it can't be use with enable_lora"
    +                       "Setting enable_lora to None."
    +                       "(Don't worry, LoRA is still enabled, just not separately trained.)"
    +                   )
    +                   self.peft_config.enable_lora = None
    +                   if kwargs["fan_in_fan_out"]:
    +                       warnings.warn(
    +                           "fan_in_fan_out is set to True but the target module is not a Conv1D. "
    +                           "Setting fan_in_fan_out to False."
    +                       )
    +                       kwargs["fan_in_fan_out"] = False
                    kwargs.update(

This is not much more than the bandage, but it will give meaningful warning to users and it lets the training to continue.

2

Another way is not to set enable_lora by default, instead require user to pass enable_lora through LoraConfig if they want to separate q, k and v.

Then raise a error when user try to use enable_lora and loaded_in_8bit at the same time.

3

There might be other possible fixes, however with my limited knowledge, I can't provide it.

pacman100 commented 1 year ago

Hello @kuronekosaiko, thank you for the detailed pointers and deep dive. Could you and @imrankh46 try #157 and see if that resolves the issue for Bloom model? Known caveat: won't work for GPT-2

imrankh46 commented 1 year ago

Thanks for explaining.

kuronekosaiko commented 1 year ago

@pacman100 Thanks, it work like a charm.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

imrankh46 commented 1 year ago

Hello @kuronekosaiko, thank you for the detailed pointers and deep dive. Could you and @imrankh46 try #157 and see if that resolves the issue for Bloom model? Known caveat: won't work for GPT-2

Yeah the issues was solved...

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

shanggangli commented 1 year ago

l meet the same problem when l use the module of LoRA to fine tuning the ChatGLM-6B-int4.

NuclearManD commented 1 year ago

I am also having issues with this, trying to train llama-13b-4bit through text-generation-webui.

Training 'llama' model using (q, v) projections
Trainable params: 26,214,400 (1.3496 %), All params: 1,942,410,240 (Model: 1,916,195,840)
2023-07-24 16:31:22 INFO:Log file 'train_dataset_sample.json' created in the 'logs' directory.
wandb: Tracking run with wandb version 0.15.5
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Exception in thread Thread-3 (threaded_run):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/nuclaer/gitrepos/text-generation-webui/modules/training.py", line 665, in threaded_run
    trainer.train()
  File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 786, in forward
    return self.base_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 433, in forward
    return self.model(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora.py", line 668, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: self and mat2 must have the same dtype
2023-07-24 16:31:24 INFO:Training complete, saving...
2023-07-24 16:31:24 INFO:Training complete!

Interestingly, text-generation-webui claims the training is completed. Anyway, it seems that the source of peft/tuners/lora.py has changed quite a bit since the bulk of this conversation, and it's not obvious to me how to fix it. I'm new to these repositories. As far as I can tell, the problem originally mentioned in this thread is in regards to 8-bit training. But perhaps the fix was never made for 4-bit?

Here's some information about my system and installations: Output of uname -a: Linux nuclaer-iridium 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Nvidia driver version: 515 Cuda version: 11.7 Graphics cards: a GTX 1070 8GB and a RTX 3060 12GB Peft version: peft-0.4.0 Commit hash for text-generation-webui: 3ef49397bbbf93cc12ab21d83d9a40a83cf8d68e I have monkeypatch installed to allow 4bit training with AutoGPTQ.

Has anyone gotten 4-bit training to work with this recently? Is there something I'm missing?

NNDEV1 commented 1 year ago

Getting the same error with Llama-2-7b-Chat-GPTQ-4bit. Training on colab and can't get inference to work either, possibly error with 4bit vs 8bit.

Axe-- commented 1 year ago

Was facing this error with GPT-2 as well, with peft==0.3, but upgrading to 0.4 resolved it. (fan_in_fan_out=True)

NNDEV1 commented 1 year ago

Can you send a snippet of your code? I'm using peft==0.4.0 but when I try to set fan_in_fan_out=True I get a warning saying: fan_in_fan_out is set to True but the target module is torch.nn.Linear. Setting fan_in_fan_out to False. Here's my code:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.utils.peft_utils import get_gptq_peft_model
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
from peft import LoraConfig, get_peft_model, get_peft_model_state_dict, PeftModel, set_peft_model_state_dict

model_name_or_path = "TheBloke/Llama-2-7B-GPTQ"
model_basename = "gptq_model-4bit-128g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.0,
    bias="none",
    task_type="CAUSAL_LM",
    fan_in_fan_out=True
)

model = get_peft_model(model, lora_config)

import torch

prompt = '''I think the meaning of life is'''
batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
batch = {k: v.cuda() for k, v in batch.items()}

with torch.no_grad():

    with torch.autocast("cuda"):
        print(type(model))
        generated = model.generate(inputs=batch["input_ids"],
                                do_sample=True, use_cache=True,
                                repetition_penalty=1.1,
                                max_new_tokens=20,
                                temperature=0.9,
                                top_p=0.95,
                                top_k=40,
                                return_dict_in_generate=True,
                                output_attentions=False,
                                output_hidden_states=False,
                                output_scores=False)

result_text = tokenizer.decode(generated['sequences'].cpu().tolist()[0])

I think it might be something to do with my target modules if this error is even reproducible.

Axe-- commented 1 year ago

@NNDEV1 Sure! Although I am using Bits&Bytes for quantization.

import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model

# BnB (4-bit)
bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Base Model
model = AutoModelForCausalLM.from_pretrained('gpt2', quantization_config=bnb_cfg, low_cpu_mem_usage=True)

# Reduce memory usage at the cost of some compute
# model.gradient_checkpointing_enable()

# Enable gradients for the input embeddings (for fine-tuning adapters)
# model.enable_input_require_grads()

# LoRA
config = {
    'r': 16,
    'lora_alpha': 16,
    'lora_dropout': 0.1,
    'bias': 'none',
    'fan_in_fan_out': True,
    'modules_to_save': ['score'],
    'target_modules': ['c_attn', 'c_proj'],
    'task_type': 'CAUSAL_LM'
}

lora = LoraConfig(**config)

model = get_peft_model(model, lora)

# Test: forward()
bs, seq = 2, 10
b = {'input_ids': torch.randint(0, 100, (bs, seq)), 'attention_mask': torch.ones((bs, seq))}
b['labels'] = b['input_ids']

out = model(**b)

Env: transformers==4.31.0, peft==0.4.0, bitsandbytes==0.41.0

amanmamgain9 commented 1 year ago

hey did anyone get this to work for 4bit gptq?

huggingface / peft