Closed amritgupta98 closed 9 months ago
cc @younesbelkada
hi @amritgupta98
Thanks for the issue, how do you run training? accelerate launch xxx.py
or python xxx.py
?
hi @younesbelkada
I run using python xxx.py
thanks for confirming! @amritgupta98
The reason you get this is that device_map="auto"
dispatches the model across all available GPUs evenly. I am not sure what exactly is causing this issue because PEFT should support multi-GPU by setting the lora weights into the correct device.
What is your peft version? can you try with latest PEFT ? pip install -U peft
If your intent is to optimize your GPU compute you should use DDP instead. To do that:
1- Call accelerate config
then select multi-gpu
2- Change device_map="auto"
to device_map={"": PartialState().process_index}
(read more about it here: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994 )
from accelerate import PartialState
device_index = PartialState().process_index
device_map = {"": device_index}
...
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map=device_map
...
)
3- Run your script with accelerate launch xxx.py
to trigger DDP
That way you will maximize your GPU compute by creating two copies of your training protocol and training should converge ~2x faster.
Thank you for your help!! @younesbelkada
I am using the latest version of PEFT (0.7.1). I will try DDP. But I think the problem may be with the Mistral code.
I changed my model from Mistral 7B to LLaMA 2 7B and training worked absolutely fine without any other changes.
I only replaced -
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2",
quantization_config=bnb_config,
device_map='auto',
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
)
with this -
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=token)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
quantization_config=bnb_config,
device_map='auto',
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
token=token
)
Transfered from transformers
because it cannot be reproduce with the original code
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
I found the root cause of the issue. It occur when er tried to finetune lm_head of mistral in multi gpu. If i remove lm_head or in mult gpu machine if i load model in single gpu, it is working fine. So why cant lm_head be trained on multi gpu?
It depends on the library you are using for training. Automatic device placement might not be done and the hidden states might need to be moved before the LMHead to the correct devic e
I found the root cause of the issue. It occur when er tried to finetune lm_head of mistral in multi gpu. If i remove lm_head or in mult gpu machine if i load model in single gpu, it is working fine. So why cant lm_head be trained on multi gpu?
@Abineshik this fix has worked for me. Thanks
@Abineshik @smreddy05 : Can you please provide the snippet to run this on multi-gpu with your suggested fix ?
System Info
transformers
version: 4.36.1Who can help?
@ArthurZucker and @sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am trying to fine tune a Mistral-7B-Instruct model on some data using a multi-GPU setup. The same code seems to work in a single-GPU setting (when i set CUDA_VISIBLE_DEVICES=0) but not with multiple GPUs:
Code snippet which produces the above error:
Expected behavior
The model should train in a multi-GPU setting without throwing any errors. The same script works in single-GPU setting but throws the above error in a multi-GPU setting.