Closed luisan06 closed 1 year ago
We can't fix something we can't reproduce. What is the model_path
?
Yeah sorry, is a local path for the converted llama weights. I also test with decapoda-research/llama-7b-hf.
Forgot to mention that i had to modify the script "convert_llama_weights_to_hf.py" with the following line:
print("Saving in the Transformers format.")
model.save_pretrained(model_path, max_shard_size="425MB")
Because the system crashes loading huge shards.
Can you try huggyllama/llama-7b
? The decapoda checkpoints are not compatible with Transformers. Also make sure to the latest release or main version.
I just try it, and get the same error using huggyllama/llama-7b. Im using the transformers and accelerate libraries from source, and everything to the latest.
You can load the model with the LlamaForSequenceClassification class?
Oh I didn't realize you were using this class sorry, I was trying with LlamaForCausalLM
. There is indeed a bug there, will fix ASAP.
Oh ok, i will wait then! thanks for the help :+1:
On my side, it's fixed with the PR above.
Yeah, now everything seems to work. I just started a training with device_map="auto" and the model loads correctly on multiple gpus. :tada:
What was the problem? I don't know much about meta devices.
After a minimal training test now i get the following error:
RuntimeError: Loading a quantized checkpoint into non-quantized Linear8bitLt is not supported. Please call module.cuda() before module.load_state_dict()
Its a huge trace to copy. Im going to see if i missed any casting.
cc @younesbelkada
This is the trace, only fails when i set the trainer with an eval and save strategy, and set load_best_model_at_end=True.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[30], line 1
----> 1 trainer.train()
File ~/LLaMA/transformers/src/transformers/trainer.py:1661, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1656 self.model_wrapped = self.model
1658 inner_training_loop = find_executable_batch_size(
1659 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1660 )
-> 1661 return inner_training_loop(
1662 args=args,
1663 resume_from_checkpoint=resume_from_checkpoint,
1664 trial=trial,
1665 ignore_keys_for_eval=ignore_keys_for_eval,
1666 )
File ~/LLaMA/transformers/src/transformers/trainer.py:2048, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2045 elif is_sagemaker_mp_enabled():
2046 smp.barrier()
-> 2048 self._load_best_model()
2050 # add remaining tr_loss
2051 self._total_loss_scalar += tr_loss.item()
File ~/LLaMA/transformers/src/transformers/trainer.py:2224, in Trainer._load_best_model(self)
2219 state_dict = torch.load(best_model_path, map_location="cpu")
2221 # If the model is on the GPU, it still works!
2222 # workaround for FSDP bug https://github.com/pytorch/pytorch/issues/82963
2223 # which takes *args instead of **kwargs
-> 2224 load_result = model.load_state_dict(state_dict, False)
2225 if not is_sagemaker_mp_enabled():
2226 self._issue_warnings_after_load(load_result)
File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/torch/nn/modules/module.py:2027, in Module.load_state_dict(self, state_dict, strict)
2020 out = hook(module, incompatible_keys)
2021 assert out is None, (
2022 "Hooks registered with ``register_load_state_dict_post_hook`` are not"
2023 "expected to return new values, if incompatible_keys need to be modified,"
2024 "it should be done inplace."
2025 )
-> 2027 load(self, state_dict)
2028 del load
2030 if strict:
File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/torch/nn/modules/module.py:2015, in Module.load_state_dict.<locals>.load(module, local_state_dict, prefix)
2013 child_prefix = prefix + name + '.'
2014 child_state_dict = {k: v for k, v in local_state_dict.items() if k.startswith(child_prefix)}
-> 2015 load(child, child_state_dict, child_prefix)
2017 # Note that the hook can modify missing_keys and unexpected_keys.
2018 incompatible_keys = _IncompatibleKeys(missing_keys, unexpected_keys)
File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/torch/nn/modules/module.py:2015, in Module.load_state_dict.<locals>.load(module, local_state_dict, prefix)
2013 child_prefix = prefix + name + '.'
2014 child_state_dict = {k: v for k, v in local_state_dict.items() if k.startswith(child_prefix)}
-> 2015 load(child, child_state_dict, child_prefix)
2017 # Note that the hook can modify missing_keys and unexpected_keys.
2018 incompatible_keys = _IncompatibleKeys(missing_keys, unexpected_keys)
[... skipping similar frames: Module.load_state_dict.<locals>.load at line 2015 (4 times)]
File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/torch/nn/modules/module.py:2015, in Module.load_state_dict.<locals>.load(module, local_state_dict, prefix)
2013 child_prefix = prefix + name + '.'
2014 child_state_dict = {k: v for k, v in local_state_dict.items() if k.startswith(child_prefix)}
-> 2015 load(child, child_state_dict, child_prefix)
2017 # Note that the hook can modify missing_keys and unexpected_keys.
2018 incompatible_keys = _IncompatibleKeys(missing_keys, unexpected_keys)
File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/torch/nn/modules/module.py:2009, in Module.load_state_dict.<locals>.load(module, local_state_dict, prefix)
2007 def load(module, local_state_dict, prefix=''):
2008 local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
-> 2009 module._load_from_state_dict(
2010 local_state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
2011 for name, child in module._modules.items():
2012 if child is not None:
File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/bitsandbytes/nn/modules.py:298, in Linear8bitLt._load_from_state_dict(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
295 if input_name == "SCB":
296 if self.weight.SCB is None:
297 # buffers not yet initialized, can't call them directly without
--> 298 raise RuntimeError("Loading a quantized checkpoint into non-quantized Linear8bitLt is "
299 "not supported. Please call module.cuda() before module.load_state_dict()")
301 input_param = state_dict[key]
302 self.weight.SCB.copy_(input_param)
RuntimeError: Loading a quantized checkpoint into non-quantized Linear8bitLt is not supported. Please call module.cuda() before module.load_state_dict()
Maybe is an error from bitsandbytes.
I'm sorry to bother you again, @younesbelkada @sgugger. But after the last hotfix the SequenceClassification model doesn't train as expected, always logs nan in metrics and when i make inference the logits are nan too.
This is an example:
tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]],
device='cuda:0', dtype=torch.float16)
This is the code that i'm using for the test, a copy/paste from a notebook:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
import torch
from peft import PeftConfig, PeftModel, prepare_model_for_int8_training, LoraConfig, get_peft_model
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
import bitsandbytes as bnb
from transformers import (
AutoConfig,
pipeline,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling,
DataCollatorWithPadding,
LlamaTokenizer,
LlamaForCausalLM,
LlamaForSequenceClassification,
AutoModelForSequenceClassification,
pipeline,
get_linear_schedule_with_warmup,
set_seed
)
from pathlib import Path
model_path = "huggyllama/llama-7b"
tokenizer_path = "huggyllama/llama-7b"
tokenizer = LlamaTokenizer.from_pretrained(tokenizer_path)
tokenizer.pad_token = tokenizer.eos_token
from datasets import load_dataset
lang_datasets = load_dataset("papluca/language-identification")
def tokenize_function(examples):
outputs = tokenizer(examples["text"], truncation=True, max_length=128)
return outputs
tokenized_datasets = lang_datasets.map(
tokenize_function,
batched=True,
remove_columns=["text"],
)
df = lang_datasets["train"].to_pandas()
classes = sorted(list(df["labels"].unique()))
id2label = {idx: classes[idx] for idx in range(len(classes))}
label2id = {v: k for k, v in id2label.items()}
tokenized_datasets = tokenized_datasets.map(lambda example: {"labels": label2id[example["labels"]]})
tok_train = tokenized_datasets['train']
tok_valid = tokenized_datasets['validation']
tok_test = tokenized_datasets['test']
print(f"Train / valid / test samples: {len(tok_train)} / {len(tok_valid)} / {len(tok_test)}")
model = LlamaForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_path,
id2label=id2label,
label2id=label2id,
load_in_8bit=True,
device_map="auto",
low_cpu_mem_usage=True, # Evitar offloading en cpu
# ignore_mismatched_sizes=True,
torch_dtype=torch.float16,
# problem_type="multi_label_classification",
)
model.resize_token_embeddings(len(tokenizer))
model = prepare_model_for_int8_training(model=model, output_embedding_layer_name="score")
from peft import LoraConfig, get_peft_model, TaskType
peft_config = LoraConfig(
r=16,
lora_alpha=32,
inference_mode=False,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.SEQ_CLS,
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
train_args = TrainingArguments(
optim='adamw_torch',
output_dir="./runs",
num_train_epochs=1,
learning_rate=2e-4,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=1,
logging_steps=25,
push_to_hub=False,
# fp16=True, # Not working now
# load_best_model_at_end=True, # Not working now
# metric_for_best_model="accuracy",
evaluation_strategy="epoch",
# save_strategy="epoch", # Not working
dataloader_pin_memory=False,
remove_unused_columns=False,
label_names=["labels"],
)
def collate_fn(examples):
return tokenizer.pad(examples, padding=True, return_tensors="pt")
from sklearn.metrics import accuracy_score, classification_report, f1_score
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
acc = accuracy_score(labels, preds)
f1 = f1_score(labels, preds, average="weighted")
return {
"accuracy": acc,
"f1": f1
}
trainer = Trainer(
model=model,
train_dataset=tok_train,
eval_dataset=tok_valid,
# compute_metrics=compute_metrics, # Not working
args=train_args,
data_collator=collate_fn,
tokenizer=tokenizer
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
And this is the inference:
def detect_lang(text: str) -> str:
batch = tokenizer(text, return_tensors="pt")
with torch.cuda.amp.autocast():
with torch.no_grad():
logits = model(input_ids = batch.input_ids.to(device),
attention_mask = batch.attention_mask.to(device)).logits
print(logits)
predicted_class_id = logits.argmax().item()
return model.config.id2label[predicted_class_id]
I tested the same code with another models, xlm-roberta-base, following this example and the fp16 doesnt work too. Its weird because i can train casual lm models but no classifiers.
You can't train a model in pure float16, you need mixed precision training. So you will need to load the model in float32 and probably use DeepSpeed for your fine-tuning.
Unless you use peft and LORA, where only the low-rank adapters need to be in float32.
I have the same issue with you @luisan06 , how did you fix you then?
I couldn't fix it @robotsp , I spent some time trying to figure out what was wrong. I was only able to train the model using LORA adapters in float32. But when I save the model and try to get the metrics again, loading the adapter from the local path, it seems that the adapter is not saved with the trained weights. Because it works right after training, but if you restart the environment and reload everything, the model performs really bad.
I've tried training by loading the model in 8 bit and training with mixed precision using LORA adapters, like I do with casual lm, but I can't do it with the classifier.
I don't really have experience with adapters, and all I've tried and seen was training casual lm models for chatbots like tasks. I don't know if there is something special for classifiers.
Hello @luisan06 and @robotsp, currently INT8 with PEFT for Sequence Classification tasks produce nan
losses as noted above. For now, please use DeepSpeed + PEFT for this
ok!, thanks for the response. Will try with DeepSpeed then and close the issue.
good good good!
@pacman100 @luisan06 I get a Zero loss after a few iterations, and the final weights of the model is all NAN. Model: TinyLLama model is fully finetuned on a completion task using FSDP
cuda: 11.8 pytorch: 2.0.1+cu118 accelerate: 0.24.0.dev0 transformers: 4.35.0.dev0
https://github.com/OpenAccess-AI-Collective/axolotl/issues/1191
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Im doing some research using the llama model as casual lm without any problems, but when i try to load the model as SequenceClassification can't load it using the accelerate tools.
Crashes with the following trace:
I read some issues about this error, but anything works. After some debugging, i think, the loading process fails trying to load the score layer due to the method set_module_tensor_to_device receives the nn.Linear module without any weights. Im not sure at all if that is the error.
I can load & train the model as follows:
The problem is that i can train it in a single gpu with a large dataset, and when i try to use an instance with multi-gpu the trainer loads the whole model in each gpu and get an OOM error.
Probably i'm missing something...
Expected behavior
Load and train the model without any issues.