Closed adhiiisetiawan closed 1 year ago
Hi @adhiiisetiawan, thanks for reporting!
So that we can best help you, could you:
transformers-cli env
in the terminal and copy-paste the outputRegarding the PEFT logic - why are you modifying the state_dict directly like this? You can follow the PEFT docs to see the canonical way to load and prepare a model for training: https://huggingface.co/docs/peft/task_guides/image_classification_lora#train-and-evaluate
cc @muellerzr as we discussed this issue yesterday; seems like safetensors aren't very friendly with the Trainer
@adhiiisetiawan your issue is the call to torch.compile()
. If that step is skipped, you can save and load no problem.
With it included, you should find that model.state_dict()
is completely empty, leading to this issue.
The sole reason it doesn't error without safetensors is because torch/pickle is okay loading in the empty dictionary as well. You can see this by simply adding the following code at the end:
model.save_pretrained("test_model", safe_serialization=False)
f = torch.load("test_model/adapter_model.bin")
print(f)
It should print {}
. Remove the .compile()
and it will work fine. This is a peft
issue specifically with save_pretrained
and it's behavior with torch.compile
. cc @BenjaminBossan
A note on PEFT + torch.compile
: Unfortunately, torch.compile
still has a couple of gaps that make it not work properly in PEFT. There is not much we can do about it except to wait for PyTorch to close those gaps. How that can lead to an empty state_dict
, I don't know.
Oh I see, I got it. Thank you very much all for your answer and details explanation @amyeroberts @LysandreJik @muellerzr @BenjaminBossan
safetensors it's work now without torch.compile
@adhiiisetiawan Hello~ I want to know whether the LoRA training will be slowed down without torch.compile? also the the memory consumption increased?
hi @MerrillLi, in my case, i dont have any issue without torch.compile
. sorry for late response
I'm having this same issue (details here: https://github.com/huggingface/transformers/issues/28742). Could anyone please help?
I've had exactly the same issue but I did not use torch.compile. Once I choose save.pretrained() I get the same problem..
My code is here
import os
import argparse
from transformers import (
LlamaForCausalLM,
LlamaTokenizer,
LlamaConfig,
set_seed,
default_data_collator,
BitsAndBytesConfig,
Trainer,
TrainingArguments,
)
from datasets import load_from_disk
import torch
import bitsandbytes as bnb
from huggingface_hub import login, HfFolder
import accelerate
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
model_id = "psymon/KoLlama2-7b" # sharded weights
tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = LlamaForCausalLM.from_pretrained(
model_id,
use_cache=False,
device_map="auto",
quantization_config=bnb_config,
)
def find_all_linear_names(model):
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, bnb.nn.Linear4bit):
names = name.split(".")
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if "lm_head" in lora_module_names: # needed for 16-bit
lora_module_names.remove("lm_head")
return list(lora_module_names)
def create_peft_model(model, gradient_checkpointing=True, bf16=True):
from peft import (
get_peft_model,
LoraConfig,
TaskType,
prepare_model_for_kbit_training,
)
from peft.tuners.lora import LoraLayer
# prepare int-4 model for training
model = prepare_model_for_kbit_training(
model, use_gradient_checkpointing=gradient_checkpointing
)
if gradient_checkpointing:
model.gradient_checkpointing_enable()
# get lora target modules
modules = find_all_linear_names(model)
print(f"Found {len(modules)} modules to quantize: {modules}")
peft_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=modules,
lora_dropout=0.1,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, peft_config)
# pre-process the model by upcasting the layer norms in float 32 for
for name, module in model.named_modules():
if isinstance(module, LoraLayer):
if bf16:
module = module.to(torch.bfloat16)
if "norm" in name:
module = module.to(torch.float32)
if "lm_head" in name or "embed_tokens" in name:
if hasattr(module, "weight"):
if bf16 and module.weight.dtype == torch.float32:
module = module.to(torch.bfloat16)
model.print_trainable_parameters()
return model
# create peft config
model = create_peft_model(model, gradient_checkpointing=True, bf16=True)
output_dir = XXXXX
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=8,
bf16=True, # Use BF16 if available
learning_rate=5e-5,
num_train_epochs=3,
gradient_checkpointing=True,
# logging strategies
logging_dir=f"{output_dir}/logs",
logging_strategy="steps",
logging_steps=10,
save_strategy="no",
)
# Create a data collator
data_collator = DataCollatorWithPadding(tokenizer)
# Initialize the custom Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
data_collator=data_collator,
)
# Start training
trainer.train()
model.merge_and_unload()
model.save_pretrained('model_name')
@cam59 Thanks for providing the reproducer. Unfortunately, I cannot reproduce the error. I made some small changes to your script, as you don't provide the data. Also, I used Llama2 7b. Here is the modified script:
import os
import argparse
from transformers import (
LlamaForCausalLM,
LlamaTokenizer,
LlamaConfig,
set_seed,
default_data_collator,
BitsAndBytesConfig,
Trainer,
TrainingArguments,
)
from datasets import load_dataset
import torch
import bitsandbytes as bnb
from huggingface_hub import login, HfFolder
import accelerate
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments, DataCollatorForLanguageModeling
# model_id = "psymon/KoLlama2-7b" # sharded weights
model_id = "meta-llama/Llama-2-7b-hf" # BB
tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = LlamaForCausalLM.from_pretrained(
model_id,
use_cache=False,
device_map="auto",
quantization_config=bnb_config,
)
def find_all_linear_names(model):
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, bnb.nn.Linear4bit):
names = name.split(".")
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if "lm_head" in lora_module_names: # needed for 16-bit
lora_module_names.remove("lm_head")
return list(lora_module_names)
def create_peft_model(model, gradient_checkpointing=True, bf16=True):
from peft import (
get_peft_model,
LoraConfig,
TaskType,
prepare_model_for_kbit_training,
)
from peft.tuners.lora import LoraLayer
# prepare int-4 model for training
model = prepare_model_for_kbit_training(
model, use_gradient_checkpointing=gradient_checkpointing
)
if gradient_checkpointing:
model.gradient_checkpointing_enable()
# get lora target modules
modules = find_all_linear_names(model)
print(f"Found {len(modules)} modules to quantize: {modules}")
peft_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=modules,
lora_dropout=0.1,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, peft_config)
# pre-process the model by upcasting the layer norms in float 32 for
for name, module in model.named_modules():
if isinstance(module, LoraLayer):
if bf16:
module = module.to(torch.bfloat16)
if "norm" in name:
module = module.to(torch.float32)
if "lm_head" in name or "embed_tokens" in name:
if hasattr(module, "weight"):
if bf16 and module.weight.dtype == torch.float32:
module = module.to(torch.bfloat16)
model.print_trainable_parameters()
return model
# create peft config
model = create_peft_model(model, gradient_checkpointing=True, bf16=True)
output_dir = "/tmp/peft/transformers/27397"
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=8,
bf16=True, # Use BF16 if available
learning_rate=5e-5,
#num_train_epochs=3,
max_steps=2, # BB
gradient_checkpointing=True,
# logging strategies
logging_dir=f"{output_dir}/logs",
logging_strategy="steps",
logging_steps=10,
save_strategy="no",
)
# Create a data collator
# data_collator = DataCollatorWithPadding(tokenizer)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False) # BB
# BB
data = load_dataset("ybelkada/english_quotes_copy")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
train_dataset = data["train"]
test_dataset = data["train"]
# Initialize the custom Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
data_collator=data_collator,
)
# Start training
trainer.train()
model.merge_and_unload()
model.save_pretrained(f"{output_dir}/final_model")
Could you check if this passes successfully for you? If yes, any idea what the crucial difference is to your script?
Btw, find_all_linear_names
should not be necessary anymore, you can pass target_modules="all-linear"
to the LoraConfig
.
I have same problem. But i don't use torch.compile either. I used sfttrainer to train the model with lora and deepspeed zero3. but when i load the checkpoint in epoch save strategy, it would get the error. here is my trainingarguments:
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=8,
bf16=True, # Use BF16 if available
learning_rate=5e-5,
num_train_epochs=3,
optim = "adamw_torch",
# logging strategies
logging_dir=f"{output_dir}/logs",
logging_strategy="steps",
logging_steps=10,
save_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
data_collator=data_collator,
peft_config = peft_config
)
System Info
Hi guys, i just fine tune alpaca (LLaMA 7B base model) with custom dataset and using trainer API. After completing the training process, I received the following error:
and here's my code:
I already got this error maybe 3 times. Initially, I suspected it might be related to the model I was using (Alpaca weight base model), but even after switching to the LLaMA 7B base model, the problem persists. Still can't found the root cause and how to solve the problem. But, in my opinion the problem comes from safetensor model itself. Because when I try to open safetensor model using this code, I got same error.
Note: I installed the transformers library from source. When using the version from PyPI, I didn't encounter an error because the model was saved in .bin format, rather than .safetensor.
Reproduction
To reproduce the behavior:
Expected behavior
Can complete train model using .safetensor
Update
The training process complete using transformers from source, but the model is .bin, not .safetensor. It's okay, but i still curious, why in safetensor got an error when try to open it. here's my colab link when i test to open safetensor model