Error loading/training LlamaForSequenceClassification with map_device="auto", load_in_8bit=True and fp16=True

luisan06 commented 1 year ago

System Info

- `Accelerate` version: 0.18.0.dev0
- Platform: Linux-5.15.0-1033-aws-x86_64-with-glibc2.31
- Python version: 3.11.3
- Numpy version: 1.24.2
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- `Accelerate` default config:
    Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

Im doing some research using the llama model as casual lm without any problems, but when i try to load the model as SequenceClassification can't load it using the accelerate tools.

import torch
from peft import PeftConfig, PeftModel
import bitsandbytes as bnb
from accelerate import init_empty_weights, load_checkpoint_and_dispatch, load_checkpoint_in_model, infer_auto_device_map
from transformers import (
    AutoConfig,
    pipeline,
    TrainingArguments,
    Trainer,    
    DataCollatorWithPadding,
    LlamaTokenizer,
    LlamaForSequenceClassification,
    LlamaConfig,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM
)
from pathlib import Path
model = AutoModelForSequenceClassification.from_pretrained(
                                                       pretrained_model_name_or_path=model_path,
                                                       id2label=id2label,
                                                       label2id=label2id,
                                                       load_in_8bit=True,
                                                       device_map="auto",
                                                       torch_dtype=torch.float16,
                                                      )

Crashes with the following trace:

Traceback (most recent call last):
  File "/home/ubuntu/LLaMA/training_test.py", line 30, in <module>
    model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_path,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/LLaMA/transformers/src/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/LLaMA/transformers/src/transformers/modeling_utils.py", line 2846, in from_pretrained
    dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_index=offload_index, offload_buffers=True)
  File "/home/ubuntu/LLaMA/accelerate/src/accelerate/big_modeling.py", line 370, in dispatch_model
    attach_align_device_hook_on_blocks(
  File "/home/ubuntu/LLaMA/accelerate/src/accelerate/hooks.py", line 478, in attach_align_device_hook_on_blocks
    add_hook_to_module(module, hook)
  File "/home/ubuntu/LLaMA/accelerate/src/accelerate/hooks.py", line 155, in add_hook_to_module
    module = hook.init_hook(module)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/LLaMA/accelerate/src/accelerate/hooks.py", line 251, in init_hook
    set_module_tensor_to_device(module, name, self.execution_device)
  File "/home/ubuntu/LLaMA/accelerate/src/accelerate/utils/modeling.py", line 136, in set_module_tensor_to_device
    raise ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {device}.")
ValueError: weight is on the meta device, we need a `value` to put in on 0.

I read some issues about this error, but anything works. After some debugging, i think, the loading process fails trying to load the score layer due to the method set_module_tensor_to_device receives the nn.Linear module without any weights. Im not sure at all if that is the error.

I can load & train the model as follows:

model = AutoModelForSequenceClassification.from_pretrained(
                                                       pretrained_model_name_or_path=model_path,
                                                       id2label=id2label,
                                                       label2id=label2id,
                                                       torch_dtype=torch.float16,
                                                      ).to("cuda")

The problem is that i can train it in a single gpu with a large dataset, and when i try to use an instance with multi-gpu the trainer loads the whole model in each gpu and get an OOM error.

Probably i'm missing something...

Expected behavior

Load and train the model without any issues.

sgugger commented 1 year ago

We can't fix something we can't reproduce. What is the model_path?

luisan06 commented 1 year ago

Yeah sorry, is a local path for the converted llama weights. I also test with decapoda-research/llama-7b-hf.

Forgot to mention that i had to modify the script "convert_llama_weights_to_hf.py" with the following line:

print("Saving in the Transformers format.")
model.save_pretrained(model_path, max_shard_size="425MB")

Because the system crashes loading huge shards.

sgugger commented 1 year ago

Can you try huggyllama/llama-7b? The decapoda checkpoints are not compatible with Transformers. Also make sure to the latest release or main version.

luisan06 commented 1 year ago

I just try it, and get the same error using huggyllama/llama-7b. Im using the transformers and accelerate libraries from source, and everything to the latest.

You can load the model with the LlamaForSequenceClassification class?

sgugger commented 1 year ago

Oh I didn't realize you were using this class sorry, I was trying with LlamaForCausalLM. There is indeed a bug there, will fix ASAP.

luisan06 commented 1 year ago

Oh ok, i will wait then! thanks for the help :+1:

sgugger commented 1 year ago

On my side, it's fixed with the PR above.

luisan06 commented 1 year ago

Yeah, now everything seems to work. I just started a training with device_map="auto" and the model loads correctly on multiple gpus. :tada:

What was the problem? I don't know much about meta devices.

luisan06 commented 1 year ago

After a minimal training test now i get the following error:

RuntimeError: Loading a quantized checkpoint into non-quantized Linear8bitLt is not supported. Please call module.cuda() before module.load_state_dict()

Its a huge trace to copy. Im going to see if i missed any casting.

sgugger commented 1 year ago

cc @younesbelkada

luisan06 commented 1 year ago

This is the trace, only fails when i set the trainer with an eval and save strategy, and set load_best_model_at_end=True.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[30], line 1
----> 1 trainer.train()

File ~/LLaMA/transformers/src/transformers/trainer.py:1661, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1656     self.model_wrapped = self.model
   1658 inner_training_loop = find_executable_batch_size(
   1659     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1660 )
-> 1661 return inner_training_loop(
   1662     args=args,
   1663     resume_from_checkpoint=resume_from_checkpoint,
   1664     trial=trial,
   1665     ignore_keys_for_eval=ignore_keys_for_eval,
   1666 )

File ~/LLaMA/transformers/src/transformers/trainer.py:2048, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2045     elif is_sagemaker_mp_enabled():
   2046         smp.barrier()
-> 2048     self._load_best_model()
   2050 # add remaining tr_loss
   2051 self._total_loss_scalar += tr_loss.item()

File ~/LLaMA/transformers/src/transformers/trainer.py:2224, in Trainer._load_best_model(self)
   2219         state_dict = torch.load(best_model_path, map_location="cpu")
   2221     # If the model is on the GPU, it still works!
   2222     # workaround for FSDP bug https://github.com/pytorch/pytorch/issues/82963
   2223     # which takes *args instead of **kwargs
-> 2224     load_result = model.load_state_dict(state_dict, False)
   2225 if not is_sagemaker_mp_enabled():
   2226     self._issue_warnings_after_load(load_result)

File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/torch/nn/modules/module.py:2027, in Module.load_state_dict(self, state_dict, strict)
   2020         out = hook(module, incompatible_keys)
   2021         assert out is None, (
   2022             "Hooks registered with ``register_load_state_dict_post_hook`` are not"
   2023             "expected to return new values, if incompatible_keys need to be modified,"
   2024             "it should be done inplace."
   2025         )
-> 2027 load(self, state_dict)
   2028 del load
   2030 if strict:

File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/torch/nn/modules/module.py:2015, in Module.load_state_dict.<locals>.load(module, local_state_dict, prefix)
   2013         child_prefix = prefix + name + '.'
   2014         child_state_dict = {k: v for k, v in local_state_dict.items() if k.startswith(child_prefix)}
-> 2015         load(child, child_state_dict, child_prefix)
   2017 # Note that the hook can modify missing_keys and unexpected_keys.
   2018 incompatible_keys = _IncompatibleKeys(missing_keys, unexpected_keys)

File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/torch/nn/modules/module.py:2015, in Module.load_state_dict.<locals>.load(module, local_state_dict, prefix)
   2013         child_prefix = prefix + name + '.'
   2014         child_state_dict = {k: v for k, v in local_state_dict.items() if k.startswith(child_prefix)}
-> 2015         load(child, child_state_dict, child_prefix)
   2017 # Note that the hook can modify missing_keys and unexpected_keys.
   2018 incompatible_keys = _IncompatibleKeys(missing_keys, unexpected_keys)

    [... skipping similar frames: Module.load_state_dict.<locals>.load at line 2015 (4 times)]

File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/torch/nn/modules/module.py:2015, in Module.load_state_dict.<locals>.load(module, local_state_dict, prefix)
   2013         child_prefix = prefix + name + '.'
   2014         child_state_dict = {k: v for k, v in local_state_dict.items() if k.startswith(child_prefix)}
-> 2015         load(child, child_state_dict, child_prefix)
   2017 # Note that the hook can modify missing_keys and unexpected_keys.
   2018 incompatible_keys = _IncompatibleKeys(missing_keys, unexpected_keys)

File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/torch/nn/modules/module.py:2009, in Module.load_state_dict.<locals>.load(module, local_state_dict, prefix)
   2007 def load(module, local_state_dict, prefix=''):
   2008     local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
-> 2009     module._load_from_state_dict(
   2010         local_state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
   2011     for name, child in module._modules.items():
   2012         if child is not None:

File ~/.virtualenvs/llama-demo/lib/python3.11/site-packages/bitsandbytes/nn/modules.py:298, in Linear8bitLt._load_from_state_dict(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
    295 if input_name == "SCB":
    296     if self.weight.SCB is None:
    297         # buffers not yet initialized, can't call them directly without
--> 298         raise RuntimeError("Loading a quantized checkpoint into non-quantized Linear8bitLt is "
    299                            "not supported. Please call module.cuda() before module.load_state_dict()")
    301     input_param = state_dict[key]
    302     self.weight.SCB.copy_(input_param)

RuntimeError: Loading a quantized checkpoint into non-quantized Linear8bitLt is not supported. Please call module.cuda() before module.load_state_dict()

Maybe is an error from bitsandbytes.

luisan06 commented 1 year ago

I'm sorry to bother you again, @younesbelkada @sgugger. But after the last hotfix the SequenceClassification model doesn't train as expected, always logs nan in metrics and when i make inference the logits are nan too.

This is an example:

tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]],
       device='cuda:0', dtype=torch.float16)

This is the code that i'm using for the test, a copy/paste from a notebook:

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"

import torch
from peft import PeftConfig, PeftModel, prepare_model_for_int8_training, LoraConfig, get_peft_model
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
import bitsandbytes as bnb
from transformers import (
    AutoConfig,
    pipeline,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    DataCollatorWithPadding,
    LlamaTokenizer,
    LlamaForCausalLM,
    LlamaForSequenceClassification,
    AutoModelForSequenceClassification,
    pipeline,
    get_linear_schedule_with_warmup, 
    set_seed
)

from pathlib import Path
model_path = "huggyllama/llama-7b"
tokenizer_path = "huggyllama/llama-7b"

tokenizer = LlamaTokenizer.from_pretrained(tokenizer_path)
tokenizer.pad_token = tokenizer.eos_token

from datasets import load_dataset
lang_datasets = load_dataset("papluca/language-identification")

def tokenize_function(examples):
    outputs = tokenizer(examples["text"], truncation=True, max_length=128)
    return outputs

tokenized_datasets = lang_datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"],
)

df = lang_datasets["train"].to_pandas()
classes = sorted(list(df["labels"].unique()))
id2label = {idx: classes[idx] for idx in range(len(classes))}
label2id = {v: k for k, v in id2label.items()}

tokenized_datasets = tokenized_datasets.map(lambda example: {"labels": label2id[example["labels"]]})

tok_train = tokenized_datasets['train']
tok_valid = tokenized_datasets['validation']
tok_test = tokenized_datasets['test']
print(f"Train / valid / test samples: {len(tok_train)} / {len(tok_valid)} / {len(tok_test)}")

model = LlamaForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_path,
                                                       id2label=id2label,
                                                       label2id=label2id,
                                                       load_in_8bit=True,
                                                       device_map="auto",
                                                       low_cpu_mem_usage=True, # Evitar offloading en cpu
                                                       # ignore_mismatched_sizes=True,
                                                       torch_dtype=torch.float16,
                                                       # problem_type="multi_label_classification",
                                                      )
model.resize_token_embeddings(len(tokenizer))

model = prepare_model_for_int8_training(model=model, output_embedding_layer_name="score")

from peft import LoraConfig, get_peft_model, TaskType 

peft_config = LoraConfig(
    r=16,    
    lora_alpha=32,
    inference_mode=False,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_CLS,
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

train_args = TrainingArguments(
    optim='adamw_torch',
    output_dir="./runs",
    num_train_epochs=1,
    learning_rate=2e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,
    logging_steps=25,    
    push_to_hub=False,
    # fp16=True, # Not working now
    # load_best_model_at_end=True, # Not working now
    # metric_for_best_model="accuracy",
    evaluation_strategy="epoch",
    # save_strategy="epoch", # Not working
    dataloader_pin_memory=False,
    remove_unused_columns=False,
    label_names=["labels"],
)

def collate_fn(examples): 
    return tokenizer.pad(examples, padding=True, return_tensors="pt")

from sklearn.metrics import accuracy_score, classification_report, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average="weighted")
    return {
        "accuracy": acc,
        "f1": f1
        }

trainer = Trainer(
    model=model,
    train_dataset=tok_train,
    eval_dataset=tok_valid,
    # compute_metrics=compute_metrics, # Not working
    args=train_args,
    data_collator=collate_fn,
    tokenizer=tokenizer
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

And this is the inference:

def detect_lang(text: str) -> str:
    batch = tokenizer(text, return_tensors="pt")

    with torch.cuda.amp.autocast():
        with torch.no_grad():
            logits = model(input_ids = batch.input_ids.to(device),
                           attention_mask = batch.attention_mask.to(device)).logits
            print(logits)

    predicted_class_id = logits.argmax().item()
    return model.config.id2label[predicted_class_id]

luisan06 commented 1 year ago

I tested the same code with another models, xlm-roberta-base, following this example and the fp16 doesnt work too. Its weird because i can train casual lm models but no classifiers.

sgugger commented 1 year ago

You can't train a model in pure float16, you need mixed precision training. So you will need to load the model in float32 and probably use DeepSpeed for your fine-tuning.

Unless you use peft and LORA, where only the low-rank adapters need to be in float32.

robotsp commented 1 year ago

I have the same issue with you @luisan06 , how did you fix you then?

luisan06 commented 1 year ago

I couldn't fix it @robotsp , I spent some time trying to figure out what was wrong. I was only able to train the model using LORA adapters in float32. But when I save the model and try to get the metrics again, loading the adapter from the local path, it seems that the adapter is not saved with the trained weights. Because it works right after training, but if you restart the environment and reload everything, the model performs really bad.

I've tried training by loading the model in 8 bit and training with mixed precision using LORA adapters, like I do with casual lm, but I can't do it with the classifier.

I don't really have experience with adapters, and all I've tried and seen was training casual lm models for chatbots like tasks. I don't know if there is something special for classifiers.

pacman100 commented 1 year ago

Hello @luisan06 and @robotsp, currently INT8 with PEFT for Sequence Classification tasks produce nan losses as noted above. For now, please use DeepSpeed + PEFT for this

luisan06 commented 1 year ago

ok!, thanks for the response. Will try with DeepSpeed then and close the issue.

robotsp commented 1 year ago

good good good!

hahmad2008 commented 9 months ago

@pacman100 @luisan06 I get a Zero loss after a few iterations, and the final weights of the model is all NAN. Model: TinyLLama model is fully finetuned on a completion task using FSDP

cuda: 11.8 pytorch: 2.0.1+cu118 accelerate: 0.24.0.dev0 transformers: 4.35.0.dev0

https://github.com/OpenAccess-AI-Collective/axolotl/issues/1191

huggingface / accelerate