Multi-GPU inference affects LLM's (Llama2-7b-chat-hf) generation. #31582

Open 3rdAT opened 3 days ago

3rdAT commented 3 days ago

System Info

Who can help?

@ArthurZucker @gante




When I perform inference with two GPUs using the following command,

CUDA_VISIBLE_DEVICES=0,1 nohup python ./

the model generates answer properly.

Whereas, when I use more than two GPUs using the following command,

CUDA_VISIBLE_DEVICES=0,1,3,4 nohup python ./

the model starts generating Gibberish. Upon close introspection, the model outputs logits which are all NaN values.

Note: I use device_map = "auto" while loading the model.

Expected behavior

I expect the model to generate properly.