Closed sorryhyun closed 1 year ago
Hi @comchobo
Thanks for the issue!
I think you need to load your model with accelerate
:
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-1.3B", device_map="auto")
And also make sure to use the main
branch of transformers
as it has a fix for Trainer + multi-gpu: https://github.com/huggingface/transformers/pull/22532
To use the main
branch of transformers
:
pip install git+https://github.com/huggingface/transformers.git
I followed the instructions but still getting error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
I ran this code with following command:
CUDA_VISIBLE_DEVICES=0,1,2 python lora_training.py
@comchobo can you share the full traceback of the error?
Traceback (most recent call last):
File "/home/sorryhyun/finetune_nllb_with_transformer/lora_training.py", line 96, in <module>
train_lora('lora_training_nllb_1p3B_lr=5e-4','lora_training_nllb_1p3B_lr=5e-4_saved',lr=5e-4)
File "/home/sorryhyun/finetune_nllb_with_transformer/lora_training.py", line 92, in train_lora
trainer.train()
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/transformers/trainer.py", line 2699, in training_step
loss = self.compute_loss(model, inputs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/transformers/trainer.py", line 2731, in compute_loss
outputs = model(**inputs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/peft/peft_model.py", line 667, in forward
return self.base_model(
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 1335, in forward
outputs = self.model(
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 1208, in forward
encoder_outputs = self.encoder(
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 837, in forward
layer_outputs = encoder_layer(
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 390, in forward
hidden_states, attn_weights, _ = self.self_attn(
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 249, in forward
query_states = self.q_proj(hidden_states) * self.scaling
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/peft/tuners/lora.py", line 350, in forward
result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sorryhyun/anaconda3/envs/sorryhyun/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
What is your peft
version?
It is peft==0.2.0
You need to install peft
from source, as you need to have this fix: https://github.com/huggingface/peft/pull/145
Please uninstall peft
and re-install it with the following command:
pip install git+https://github.com/huggingface/peft
Thanks! It's now working @younesbelkada
I tried to fine-tune NLLB model on my custom dataset on multi-gpu environment, and it makes following error.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)
It was ok on single GPU. Is there anything I need to modify code for multi-gpu?
I attempted to run following code: