amritgupta98 commented 11 months ago

System Info

transformers version: 4.36.1
Platform: Amazon Sagemaker Studio
Python version: 3.10
PyTorch version (GPU?): 2.0.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help?

@ArthurZucker and @sgugger

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I am trying to fine tune a Mistral-7B-Instruct model on some data using a multi-GPU setup. The same code seems to work in a single-GPU setting (when i set CUDA_VISIBLE_DEVICES=0) but not with multiple GPUs:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[12], line 93
     80 trainer = SFTTrainer(
     81     model=model,
     82     train_dataset=dataset,
   (...)
     88     args=train_args,
     89 )
     92 # train
---> 93 trainer.train()
     95 # save model
     96 trainer.save_model()

File /opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:280, in SFTTrainer.train(self, *args, **kwargs)
    277 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
    278     self.model = self._trl_activate_neftune(self.model)
--> 280 output = super().train(*args, **kwargs)
    282 # After training we make sure to retrieve back the original forward pass method
    283 # for the embedding layer by removing the forward post hook.
    284 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1535         hf_hub_utils.enable_progress_bars()
   1536 else:
-> 1537     return inner_training_loop(
   1538         args=args,
   1539         resume_from_checkpoint=resume_from_checkpoint,
   1540         trial=trial,
   1541         ignore_keys_for_eval=ignore_keys_for_eval,
   1542     )

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1851     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   1853 with self.accelerator.accumulate(model):
-> 1854     tr_loss_step = self.training_step(model, inputs)
   1856 if (
   1857     args.logging_nan_inf_filter
   1858     and not is_torch_tpu_available()
   1859     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1860 ):
   1861     # if loss is nan or inf simply add the average of previous logged losses
   1862     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2728, in Trainer.training_step(self, model, inputs)
   2725     return loss_mb.reduce_mean().detach().to(self.args.device)
   2727 with self.compute_loss_context_manager():
-> 2728     loss = self.compute_loss(model, inputs)
   2730 if self.args.n_gpu > 1:
   2731     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2751, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2749 else:
   2750     labels = None
-> 2751 outputs = model(**inputs)
   2752 # Save past state if it exists
   2753 # TODO: this needs to be fixed and made cleaner later.
   2754 if self.args.past_index >= 0:

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:680, in convert_outputs_to_fp32.<locals>.forward(*args, **kwargs)
    679 def forward(*args, **kwargs):
--> 680     return model_forward(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:668, in ConvertOutputsToFp32.__call__(self, *args, **kwargs)
    667 def __call__(self, *args, **kwargs):
--> 668     return convert_to_fp32(self.model_forward(*args, **kwargs))

File /opt/conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py:14, in autocast_decorator.<locals>.decorate_autocast(*args, **kwargs)
     11 @functools.wraps(func)
     12 def decorate_autocast(*args, **kwargs):
     13     with autocast_instance:
---> 14         return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/peft/peft_model.py:1073, in PeftModelForCausalLM.forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, task_ids, **kwargs)
   1062             raise AssertionError("forward in MPTForCausalLM does not support inputs_embeds")
   1063         return self.base_model(
   1064             input_ids=input_ids,
   1065             attention_mask=attention_mask,
   (...)
   1070             **kwargs,
   1071         )
-> 1073     return self.base_model(
   1074         input_ids=input_ids,
   1075         attention_mask=attention_mask,
   1076         inputs_embeds=inputs_embeds,
   1077         labels=labels,
   1078         output_attentions=output_attentions,
   1079         output_hidden_states=output_hidden_states,
   1080         return_dict=return_dict,
   1081         **kwargs,
   1082     )
   1084 batch_size = _get_batch_size(input_ids, inputs_embeds)
   1085 if attention_mask is not None:
   1086     # concat prompt attention mask

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:103, in BaseTuner.forward(self, *args, **kwargs)
    102 def forward(self, *args: Any, **kwargs: Any):
--> 103     return self.model.forward(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    163         output = module._old_forward(*args, **kwargs)
    164 else:
--> 165     output = module._old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File /opt/conda/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:1057, in MistralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1044 outputs = self.model(
   1045     input_ids=input_ids,
   1046     attention_mask=attention_mask,
   (...)
   1053     return_dict=return_dict,
   1054 )
   1056 hidden_states = outputs[0]
-> 1057 logits = self.lm_head(hidden_states)
   1058 logits = logits.float()
   1060 loss = None

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/peft/tuners/lora/layer.py:373, in Linear.forward(self, x, *args, **kwargs)
    371         scaling = self.scaling[active_adapter]
    372         x = x.to(lora_A.weight.dtype)
--> 373         result += lora_B(lora_A(dropout(x))) * scaling
    375 result = result.to(previous_dtype)
    376 return result

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cuda:7! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Code snippet which produces the above error:


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2",
                                             quantization_config=bnb_config,
                                             device_map='auto',
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2"
                                            )
model.config.pretraining_tp = 1

peft_config = LoraConfig(
        lora_alpha=64,
        lora_dropout=0.1,
        r=128,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=['q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj','lm_head']
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

train_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim='adamw_bnb_8bit',
    logging_steps=1,
    save_strategy='epoch',
    learning_rate=float('2e-5'),
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type='constant',
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=8192,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction,
    args=train_args,
)

trainer.train()

trainer.save_model()

Expected behavior

The model should train in a multi-GPU setting without throwing any errors. The same script works in single-GPU setting but throws the above error in a multi-GPU setting.

amyeroberts commented 11 months ago

cc @younesbelkada

younesbelkada commented 11 months ago

hi @amritgupta98 Thanks for the issue, how do you run training? accelerate launch xxx.py or python xxx.py ?

amritgupta98 commented 11 months ago

hi @younesbelkada I run using python xxx.py

younesbelkada commented 11 months ago

thanks for confirming! @amritgupta98

The reason you get this is that device_map="auto" dispatches the model across all available GPUs evenly. I am not sure what exactly is causing this issue because PEFT should support multi-GPU by setting the lora weights into the correct device.

What is your peft version? can you try with latest PEFT ? pip install -U peft

If your intent is to optimize your GPU compute you should use DDP instead. To do that:

1- Call accelerate config then select multi-gpu 2- Change device_map="auto" to device_map={"": PartialState().process_index} (read more about it here: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994 )

from accelerate import PartialState

device_index = PartialState().process_index
device_map = {"": device_index}

...

model = AutoModelForCausalLM.from_pretrained(
   model_id,
   device_map=device_map
   ...
)

3- Run your script with accelerate launch xxx.py to trigger DDP

That way you will maximize your GPU compute by creating two copies of your training protocol and training should converge ~2x faster.

amritgupta98 commented 11 months ago

Thank you for your help!! @younesbelkada

I am using the latest version of PEFT (0.7.1). I will try DDP. But I think the problem may be with the Mistral code.

I changed my model from Mistral 7B to LLaMA 2 7B and training worked absolutely fine without any other changes.

I only replaced -

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2",
                                             quantization_config=bnb_config,
                                             device_map='auto',
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2"
                                            )

with this -

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=token)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
                                             quantization_config=bnb_config,
                                             device_map='auto',
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             token=token
                                            )

ArthurZucker commented 11 months ago

Transfered from transformers because it cannot be reproduce with the original code

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Abineshik commented 8 months ago

I found the root cause of the issue. It occur when er tried to finetune lm_head of mistral in multi gpu. If i remove lm_head or in mult gpu machine if i load model in single gpu, it is working fine. So why cant lm_head be trained on multi gpu?

ArthurZucker commented 8 months ago

It depends on the library you are using for training. Automatic device placement might not be done and the hidden states might need to be moved before the LMHead to the correct devic e

smreddy05 commented 7 months ago

I found the root cause of the issue. It occur when er tried to finetune lm_head of mistral in multi gpu. If i remove lm_head or in mult gpu machine if i load model in single gpu, it is working fine. So why cant lm_head be trained on multi gpu?

@Abineshik this fix has worked for me. Thanks

amitbcp commented 6 months ago

@Abineshik @smreddy05 : Can you please provide the snippet to run this on multi-gpu with your suggested fix ?

huggingface / peft

Mistral: Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cuda:7! (when checking argument for argument mat2 in method wrapper_CUDA_mm) #1277

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior