Closed imrankh46 closed 1 year ago
Hello @imrankh46, It would be great if you can provide minimal reproducible script
reproducible
I'm using the same example fine-tune-opt-bnb-peft.ipynb
But in the notebook I just change the model name to bloom_7b.
Then they give this error.
RuntimeError: self and mat2 must have the same dtype
This issue is in fact a duplication of #141, because it is caused by the same piece of code and it gave the same error.
Just that neither of you gave the full information for meaningful troubleshooting.
@imrankh46 It can be fix by applying following changes to peft/tuners/lora.py at line 148 (as shown in #141 with minor tweak):
bias = target.bias is not None
- if loaded_in_8bit and isinstance(target, bnb.nn.Linear8bitLt) and self.peft_config.enable_lora is None:
+ if loaded_in_8bit and isinstance(target, bnb.nn.Linear8bitLt):
kwargs.update(
This does work, but this is merely a bandage, as it does not fix the underlying problem with the code, which is: there isn't a way to do MergedLinear in 8 bit in the origin code.
So, you might encounter error at inference time as well, at that time you may need to apply similar fixes (until the huggingface's team fixes it).
This is due to the model architecture of bloom: query, key and value are calculated in the same module named "query_key_value".
As in the original LoRA paper, one should only need to apply LoRA to query and value, and it will perform on par with apply LoRA to all of query, key, value and output.
Because in bloom q, k and v are merged together, PEFT uses MergedLinear to separate the q, k and v, then only train on q and v.
And here comes the problem: MergedLinear does not work with load_in_8bit.
PEFT will simply ignore load_in_8bit and continue to use 32bit or 16bit conv1D and Linear, hence the dtype mismatch.
The above also apples to GPT-NeoX and similar model with merged attention layers, so without fixes, you will not be able to train those model with load_in_8bit set to True.
As above mentioned, ignore self.peft_config.enable_lora when load_in_8bit set to True is just merely a bandage.
Here is a easy fix to peft/tuners/lora.py that I can think of:
bias = target.bias is not None
- if loaded_in_8bit and isinstance(target, bnb.nn.Linear8bitLt) and self.peft_config.enable_lora is None:
+ if loaded_in_8bit and isinstance(target, bnb.nn.Linear8bitLt):
+ if self.peft_config.enable_lora is not None
+ warnings.warn(
+ "loaded_in_8bit is set to True but it can't be use with enable_lora"
+ "Setting enable_lora to None."
+ "(Don't worry, LoRA is still enabled, just not separately trained.)"
+ )
+ self.peft_config.enable_lora = None
+ if kwargs["fan_in_fan_out"]:
+ warnings.warn(
+ "fan_in_fan_out is set to True but the target module is not a Conv1D. "
+ "Setting fan_in_fan_out to False."
+ )
+ kwargs["fan_in_fan_out"] = False
kwargs.update(
This is not much more than the bandage, but it will give meaningful warning to users and it lets the training to continue.
Another way is not to set enable_lora by default, instead require user to pass enable_lora through LoraConfig if they want to separate q, k and v.
Then raise a error when user try to use enable_lora and loaded_in_8bit at the same time.
There might be other possible fixes, however with my limited knowledge, I can't provide it.
Hello @kuronekosaiko, thank you for the detailed pointers and deep dive. Could you and @imrankh46 try #157 and see if that resolves the issue for Bloom model? Known caveat: won't work for GPT-2
Thanks for explaining.
@pacman100 Thanks, it work like a charm.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Hello @kuronekosaiko, thank you for the detailed pointers and deep dive. Could you and @imrankh46 try #157 and see if that resolves the issue for Bloom model? Known caveat: won't work for GPT-2
Yeah the issues was solved...
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
l meet the same problem when l use the module of LoRA to fine tuning the ChatGLM-6B-int4.
I am also having issues with this, trying to train llama-13b-4bit through text-generation-webui.
Training 'llama' model using (q, v) projections
Trainable params: 26,214,400 (1.3496 %), All params: 1,942,410,240 (Model: 1,916,195,840)
2023-07-24 16:31:22 INFO:Log file 'train_dataset_sample.json' created in the 'logs' directory.
wandb: Tracking run with wandb version 0.15.5
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Exception in thread Thread-3 (threaded_run):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/nuclaer/gitrepos/text-generation-webui/modules/training.py", line 665, in threaded_run
trainer.train()
File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
loss = self.compute_loss(model, inputs)
File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 569, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 786, in forward
return self.base_model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 433, in forward
return self.model(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nuclaer/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
query_states = self.q_proj(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora.py", line 668, in forward
result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: self and mat2 must have the same dtype
2023-07-24 16:31:24 INFO:Training complete, saving...
2023-07-24 16:31:24 INFO:Training complete!
Interestingly, text-generation-webui claims the training is completed. Anyway, it seems that the source of peft/tuners/lora.py
has changed quite a bit since the bulk of this conversation, and it's not obvious to me how to fix it. I'm new to these repositories. As far as I can tell, the problem originally mentioned in this thread is in regards to 8-bit training. But perhaps the fix was never made for 4-bit?
Here's some information about my system and installations:
Output of uname -a
:
Linux nuclaer-iridium 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Nvidia driver version: 515
Cuda version: 11.7
Graphics cards: a GTX 1070 8GB and a RTX 3060 12GB
Peft version: peft-0.4.0
Commit hash for text-generation-webui: 3ef49397bbbf93cc12ab21d83d9a40a83cf8d68e
I have monkeypatch installed to allow 4bit training with AutoGPTQ.
Has anyone gotten 4-bit training to work with this recently? Is there something I'm missing?
Getting the same error with Llama-2-7b-Chat-GPTQ-4bit. Training on colab and can't get inference to work either, possibly error with 4bit vs 8bit.
Was facing this error with GPT-2 as well, with peft==0.3, but upgrading to 0.4 resolved it. (fan_in_fan_out=True
)
Can you send a snippet of your code? I'm using peft==0.4.0 but when I try to set fan_in_fan_out=True I get a warning saying: fan_in_fan_out is set to True but the target module is torch.nn.Linear. Setting fan_in_fan_out to False.
Here's my code:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.utils.peft_utils import get_gptq_peft_model
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
from peft import LoraConfig, get_peft_model, get_peft_model_state_dict, PeftModel, set_peft_model_state_dict
model_name_or_path = "TheBloke/Llama-2-7B-GPTQ"
model_basename = "gptq_model-4bit-128g"
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.0,
bias="none",
task_type="CAUSAL_LM",
fan_in_fan_out=True
)
model = get_peft_model(model, lora_config)
import torch
prompt = '''I think the meaning of life is'''
batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
batch = {k: v.cuda() for k, v in batch.items()}
with torch.no_grad():
with torch.autocast("cuda"):
print(type(model))
generated = model.generate(inputs=batch["input_ids"],
do_sample=True, use_cache=True,
repetition_penalty=1.1,
max_new_tokens=20,
temperature=0.9,
top_p=0.95,
top_k=40,
return_dict_in_generate=True,
output_attentions=False,
output_hidden_states=False,
output_scores=False)
result_text = tokenizer.decode(generated['sequences'].cpu().tolist()[0])
I think it might be something to do with my target modules if this error is even reproducible.
@NNDEV1 Sure! Although I am using Bits&Bytes for quantization.
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
# BnB (4-bit)
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
# Base Model
model = AutoModelForCausalLM.from_pretrained('gpt2', quantization_config=bnb_cfg, low_cpu_mem_usage=True)
# Reduce memory usage at the cost of some compute
# model.gradient_checkpointing_enable()
# Enable gradients for the input embeddings (for fine-tuning adapters)
# model.enable_input_require_grads()
# LoRA
config = {
'r': 16,
'lora_alpha': 16,
'lora_dropout': 0.1,
'bias': 'none',
'fan_in_fan_out': True,
'modules_to_save': ['score'],
'target_modules': ['c_attn', 'c_proj'],
'task_type': 'CAUSAL_LM'
}
lora = LoraConfig(**config)
model = get_peft_model(model, lora)
# Test: forward()
bs, seq = 2, 10
b = {'input_ids': torch.randint(0, 100, (bs, seq)), 'attention_mask': torch.ones((bs, seq))}
b['labels'] = b['input_ids']
out = model(**b)
Env: transformers==4.31.0
, peft==0.4.0
, bitsandbytes==0.41.0
hey did anyone get this to work for 4bit gptq?
i got this error when i run the following code
32 frames /usr/local/lib/python3.8/dist-packages/peft/tuners/lora.py in forward(self, x) 446 return F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) 447 else: --> 448 result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) 449 if self.r > 0: 450 after_A = self.lora_A(self.lora_dropout(x))
RuntimeError: self and mat2 must have the same dtype