Title: CUDA RuntimeError: Unspecified Launch Failure during Training

Hongjie1Chu commented 4 months ago

System Info

transformers version: 4.41.0
Platform: Linux-5.15.0-88-generic-x86_64-with-glibc2.35
Python version: 3.10.6
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker @younesbelkada @muellerzr

Why does this error occur when passing a custom device_map? The map I wrote only differs from the auto-generated map in device order. Why does this cause an error? Does the device order affect the execution results?

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

import torch from torch import nn from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, LlamaForCausalLM from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding from transformers.utils.fx import symbolic_trace import argparse import numpy as np from datasets import load_metric, load_dataset

def compute_metrics(eval_preds): metric = load_metric("glue", "mrpc") logits, labels = eval_preds predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels)

def tokenize_function(example): return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

if name == "main": parser = argparse.ArgumentParser() parser.add_argument('--gpus', type=int, help='the number of gpus', default=8) parser.add_argument('--modelName', type=str, help="the name of model", default='Llama2') parser.add_argument('--bs', type=int, help="the name of bs", default=4)

args = parser.parse_args()

# Step 1: Define the model
tokenizer = AutoTokenizer.from_pretrained('FlagAlpha/Atom-7B-Chat')
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

device_map = {
    'model.embed_tokens': 6,
    'model.layers.0': 6,
    'model.layers.1': 4,
    'model.layers.2': 1,
    'model.layers.3': 1,
    'model.layers.4': 1,
    'model.layers.5': 0,
    'model.layers.6': 0,
    'model.layers.7': 0,
    'model.layers.8': 0,
    'model.layers.9': 0,
    'model.layers.10': 6,
    'model.layers.11': 5,
    'model.layers.12': 5,
    'model.layers.13': 5,
    'model.layers.14': 5,
    'model.layers.15': 5,
    'model.layers.16': 4,
    'model.layers.17': 4,
    'model.layers.18': 4,
    'model.layers.19': 4,
    'model.layers.20': 3,
    'model.layers.21': 3,
    'model.layers.22': 3,
    'model.layers.23': 3,
    'model.layers.24': 3,
    'model.layers.25': 2,
    'model.layers.26': 2,
    'model.layers.27': 2,
    'model.layers.28': 2,
    'model.layers.29': 2,
    'model.layers.30': 1,
    'model.layers.31': 1,
    "model.norm.weight": 1,
    "lm_head": 6,
}

model = AutoModelForCausalLM.from_pretrained('FlagAlpha/Atom-7B-Chat', device_map=device_map, num_labels=2)

print(model)
print(model.hf_device_map)

print("gpt start train")

# Step 4: Load the dataset
data_files = {
    'train': '/mnt/glue_mrpc/train.jsonl',
    'test': '/mnt/glue_mrpc/test.jsonl',
    'validation': '/mnt/glue_mrpc/validation.jsonl'
}
raw_datasets = load_dataset('json', data_files=data_files)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label", 'labels')
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Step 5: Train the model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=args.bs,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print('start train')
trainer.train()

Expected behavior

I want to know if the device order in the device_map affects the results.

Hongjie1Chu commented 4 months ago

and when i set : device_map["model.embed_tokens"] = 0 device_map["model.norm.weight"] = 0

it will not error at start ,but it will error after:

younesbelkada commented 4 months ago

Hi @Hongjie1Chu ! In principle the device order shouldn't affect the training behaviour - can you let us know what happens when you run the training script with CUDA_LAUNCH_BLOCKING=1 ? Also do you run your training script with accelerate launch xxx or python xxx.py?

Sharan1712 commented 4 months ago

I too am facing a similar issue. I haven't made any changes to my code but all of a sudden, my code gives this error after training for like 30 steps.

Sharan1712 commented 3 months ago

Update: I downgraded my PEFT to 10.0 and Transformers to 4.39.0 and it is working fine now

Hongjie1Chu commented 3 months ago

thanks for your answer!

Sharan1712 commented 3 months ago

Has there been a solution for this yet? I tried using the latest version of transformers and it still gave this issue. I want to use some of the new quantization methods.

Sharan1712 commented 3 months ago

@ArthurZucker @younesbelkada @muellerzr

younesbelkada commented 3 months ago

Hi ! It is hard for us to debug without a proper error trace, can you re-run the training script with CUDA_LAUNCH_BLOCKING=1 and paste the error trace here?

tlangfor commented 2 months ago

I believe I'm seeing the same issue with peft 0.11.1 and transformers 4.41.2 (both installed from conda-forge).

When I rerun with CUDA_LAUNCH_BLOCKING=1 I get:

RuntimeError                              Traceback (most recent call last)
Cell In[16], line 20
      5 trainer = SFTTrainer(
      6     model=model,
      7     train_dataset=full_doc_dataset,
   (...)
     15     compute_metrics=lambda eval_pred: compute_metrics(eval_pred, tokenizer)  # Pass tokenizer here
     16 )
     18 model = accelerator.prepare(model)
---> 20 trainer.train()

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:440, in SFTTrainer.train(self, *args, **kwargs)
    437 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
    438     self.model = self._trl_activate_neftune(self.model)
--> 440 output = super().train(*args, **kwargs)
    442 # After training we make sure to retrieve back the original forward pass method
    443 # for the embedding layer by removing the forward post hook.
    444 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1883         hf_hub_utils.enable_progress_bars()
   1884 else:
-> 1885     return inner_training_loop(
   1886         args=args,
   1887         resume_from_checkpoint=resume_from_checkpoint,
   1888         trial=trial,
   1889         ignore_keys_for_eval=ignore_keys_for_eval,
   1890     )

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2213     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   2215 with self.accelerator.accumulate(model):
-> 2216     tr_loss_step = self.training_step(model, inputs)
   2218 if (
   2219     args.logging_nan_inf_filter
   2220     and not is_torch_xla_available()
   2221     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   2222 ):
   2223     # if loss is nan or inf simply add the average of previous logged losses
   2224     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:3241, in Trainer.training_step(***failed resolving arguments***)
   3238     loss = self.compute_loss(model, inputs)
   3240 del inputs
-> 3241 torch.cuda.empty_cache()
   3243 if self.args.n_gpu > 1:
   3244     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/torch/cuda/memory.py:162, in empty_cache()
    151 r"""Release all unoccupied cached memory currently held by the caching
    152 allocator so that those can be used in other GPU application and visible in
    153 `nvidia-smi`.
   (...)
    159     more details about GPU memory management.
    160 """
    161 if is_initialized():
--> 162     torch._C._cuda_emptyCache()

RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

amyeroberts commented 2 months ago

cc @BenjaminBossan Are you the best person to ping for PEFT now?

BenjaminBossan commented 2 months ago

Hmm, I don't see how this is PEFT related, there is no PEFT code being used? Are you sure that the upgrade/downgrade of PEFT has any influence on the outcome and that it's not because of transformers?

amyeroberts commented 2 months ago

@BenjaminBossan Sorry, I was just skimming, saw peft mentioned and pinged you :)

Re SFTTrainer, perhaps @SunMarc is the best person here?

amyeroberts commented 4 weeks ago

Gentle ping @SunMarc

huggingface / transformers