Open Hongjie1Chu opened 4 months ago
and when i set : device_map["model.embed_tokens"] = 0 device_map["model.norm.weight"] = 0
it will not error at start ,but it will error after:
Hi @Hongjie1Chu !
In principle the device order shouldn't affect the training behaviour - can you let us know what happens when you run the training script with CUDA_LAUNCH_BLOCKING=1
? Also do you run your training script with accelerate launch xxx
or python xxx.py
?
I too am facing a similar issue. I haven't made any changes to my code but all of a sudden, my code gives this error after training for like 30 steps.
Update: I downgraded my PEFT to 10.0 and Transformers to 4.39.0 and it is working fine now
thanks for your answer!
Has there been a solution for this yet? I tried using the latest version of transformers and it still gave this issue. I want to use some of the new quantization methods.
@ArthurZucker @younesbelkada @muellerzr
Hi !
It is hard for us to debug without a proper error trace, can you re-run the training script with CUDA_LAUNCH_BLOCKING=1
and paste the error trace here?
I believe I'm seeing the same issue with peft 0.11.1 and transformers 4.41.2 (both installed from conda-forge).
When I rerun with CUDA_LAUNCH_BLOCKING=1
I get:
RuntimeError Traceback (most recent call last)
Cell In[16], line 20
5 trainer = SFTTrainer(
6 model=model,
7 train_dataset=full_doc_dataset,
(...)
15 compute_metrics=lambda eval_pred: compute_metrics(eval_pred, tokenizer) # Pass tokenizer here
16 )
18 model = accelerator.prepare(model)
---> 20 trainer.train()
File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:440, in SFTTrainer.train(self, *args, **kwargs)
437 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
438 self.model = self._trl_activate_neftune(self.model)
--> 440 output = super().train(*args, **kwargs)
442 # After training we make sure to retrieve back the original forward pass method
443 # for the embedding layer by removing the forward post hook.
444 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1883 hf_hub_utils.enable_progress_bars()
1884 else:
-> 1885 return inner_training_loop(
1886 args=args,
1887 resume_from_checkpoint=resume_from_checkpoint,
1888 trial=trial,
1889 ignore_keys_for_eval=ignore_keys_for_eval,
1890 )
File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2213 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
2215 with self.accelerator.accumulate(model):
-> 2216 tr_loss_step = self.training_step(model, inputs)
2218 if (
2219 args.logging_nan_inf_filter
2220 and not is_torch_xla_available()
2221 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
2222 ):
2223 # if loss is nan or inf simply add the average of previous logged losses
2224 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:3241, in Trainer.training_step(***failed resolving arguments***)
3238 loss = self.compute_loss(model, inputs)
3240 del inputs
-> 3241 torch.cuda.empty_cache()
3243 if self.args.n_gpu > 1:
3244 loss = loss.mean() # mean() to average on multi-gpu parallel training
File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/torch/cuda/memory.py:162, in empty_cache()
151 r"""Release all unoccupied cached memory currently held by the caching
152 allocator so that those can be used in other GPU application and visible in
153 `nvidia-smi`.
(...)
159 more details about GPU memory management.
160 """
161 if is_initialized():
--> 162 torch._C._cuda_emptyCache()
RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
cc @BenjaminBossan Are you the best person to ping for PEFT now?
Hmm, I don't see how this is PEFT related, there is no PEFT code being used? Are you sure that the upgrade/downgrade of PEFT has any influence on the outcome and that it's not because of transformers?
@BenjaminBossan Sorry, I was just skimming, saw peft mentioned and pinged you :)
Re SFTTrainer, perhaps @SunMarc is the best person here?
Gentle ping @SunMarc
System Info
transformers
version: 4.41.0Who can help?
@ArthurZucker @younesbelkada @muellerzr
Why does this error occur when passing a custom device_map? The map I wrote only differs from the auto-generated map in device order. Why does this cause an error? Does the device order affect the execution results?
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
import torch from torch import nn from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, LlamaForCausalLM from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding from transformers.utils.fx import symbolic_trace import argparse import numpy as np from datasets import load_metric, load_dataset
def compute_metrics(eval_preds): metric = load_metric("glue", "mrpc") logits, labels = eval_preds predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels)
def tokenize_function(example): return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
if name == "main": parser = argparse.ArgumentParser() parser.add_argument('--gpus', type=int, help='the number of gpus', default=8) parser.add_argument('--modelName', type=str, help="the name of model", default='Llama2') parser.add_argument('--bs', type=int, help="the name of bs", default=4)
Expected behavior
I want to know if the device order in the device_map affects the results.