Multi-GPU inference affects LLM's (Llama2-7b-chat-hf) generation. - Githubissues

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

128.29k stars 25.45k forks source link

Multi-GPU inference affects LLM's (Llama2-7b-chat-hf) generation. #31582

Open 3rdAT opened 3 days ago

3rdAT commented 3 days ago

System Info

transformers version: 4.40.2
Platform: Linux-5.4.0-186-generic-x86_64-with-glibc2.31
Python version: 3.9.19
Huggingface_hub version: 0.23.1
Safetensors version: 0.4.3
Accelerate version: 0.30.1
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes (A100_80GB x 4)
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker @gante

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

When I perform inference with two GPUs using the following command,

CUDA_VISIBLE_DEVICES=0,1 nohup python ./inference.py

the model generates answer properly.

Whereas, when I use more than two GPUs using the following command,

CUDA_VISIBLE_DEVICES=0,1,3,4 nohup python ./inference.py

the model starts generating Gibberish. Upon close introspection, the model outputs logits which are all NaN values.

Note: I use device_map = "auto" while loading the model.

Expected behavior

I expect the model to generate properly.