huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.29k stars 25.45k forks source link

Multi-GPU inference affects LLM's (Llama2-7b-chat-hf) generation. #31582

Open 3rdAT opened 3 days ago

3rdAT commented 3 days ago

System Info

Who can help?

@ArthurZucker @gante

Information

Tasks

Reproduction

When I perform inference with two GPUs using the following command,

CUDA_VISIBLE_DEVICES=0,1 nohup python ./inference.py

the model generates answer properly.

Whereas, when I use more than two GPUs using the following command,

CUDA_VISIBLE_DEVICES=0,1,3,4 nohup python ./inference.py

the model starts generating Gibberish. Upon close introspection, the model outputs logits which are all NaN values.

Note: I use device_map = "auto" while loading the model.

Expected behavior

I expect the model to generate properly.