Open vasili111 opened 1 year ago
It's a CUDA Device Issue!!!
use the code to set the CUDA device : (set CUDA Device)
import torch
torch.cuda.set.device(0)
check, CUDA & cuDNN compatibility
@vasili111 couple of notes here, I am not able to repro your 13B-chat issue, for the 70B torchrun --nproc_per_node 8
you are running 8 processes which requires 8 GPUs and you have two available, change it to 2, with fp16 might be able t o pull it off.
@raghu-007 and @HamidShojanazeri
Thank you for your replies and help.
I checked CUDA and Pytorch installation and Pytorch sees two GPUs and is able to run this code:
import torch
x = torch.rand(5, 3)
print(x)
I am using CUDA-11.8.0
and cuDNN/8.9.2.26
. Those are versions that are recommended as compatible by admin of HPC computer that I am using. Please let me know if this can be an issue.
It seems the problem with 70B model is as @HamidShojanazeri suggested the number of GPUs that are available (two GPUs) and asking with torchrun --nproc_per_node 8
to use 8 GPUs.
But I am also not able to run 13B model where I have 2 GPUs and I am asking to use 2 GPUs with --nproc_per_node 2
like this:
torchrun --nproc_per_node 2 example_chat_completion.py \
--ckpt_dir llama-2-13b-chat/ \
--tokenizer_path tokenizer.model \
--max_seq_len 512 --max_batch_size 6
The output I am getting:
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 14.85 seconds
and nothing else happens. So here I have 2 GPUs and asking to use 2 GPUs but it does not works any way. What can be problem?
@HamidShojanazeri
with fp16 might be able t o pull it off
Could you please clarify what exactly fp16
means here? Sorry, new to DL/LLM/HPC.
For 13b you need 3 gpus and for the 70b, 8 gpus.
@EmanuelaBoros
For 13b documentation says that it should be --nproc_per_node 2
. This means 2 GPUs, right?
@vasili111 Maybe this is true for the one provided by Meta with the download script. I wish I could tell you more, but if you look here, the model is split in 3. At least, this is how it works on my side.
My mistake for the 70b model - might be 15Gpus.
@vasili111 Having same problem here. 13b-chat model worked fine when I tested it on a server with 8 80G A100 GPUs, while utilizing only 2 GPUs. But when I downloaded the model to a server with 2 80G A100 GPUs, 13b-chat model suddenly stopped working. 7b-chat model is still working fine.
As I mentioned in this particular example 13B need only 2 gpus,
@raghu-007 and @HamidShojanazeri
Thank you for your replies and help.
I checked CUDA and Pytorch installation and Pytorch sees two GPUs and is able to run this code:
import torch x = torch.rand(5, 3) print(x) I am using CUDA-11.8.0 and cuDNN/8.9.2.26. Those are versions that are recommended as compatible by admin of HPC computer that I am using. Please let me know if this can be an issue.
It seems the problem with 70B model is as @HamidShojanazeri suggested the number of GPUs that are available (two GPUs) and asking with torchrun --nproc_per_node 8 to use 8 GPUs.
But I am also not able to run 13B model where I have 2 GPUs and I am asking to use 2 GPUs with --nproc_per_node 2 like this:
torchrun --nproc_per_node 2 example_chat_completion.py \ --ckpt_dir llama-2-13b-chat/ \ --tokenizer_path tokenizer.model \ --max_seq_len 512 --max_batch_size 6 The output I am getting:
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
initializing model parallel with size 2 initializing ddp with size 1 initializing pipeline with size 1 Loaded in 14.85 seconds and nothing else happens. So here I have 2 GPUs and asking to use 2 GPUs but it does not works any way. What can be problem?
@HamidShojanazeri
with fp16 might be able t o pull it off
Could you please clarify what exactly fp16 means here? Sorry, new to DL/LLM/HPC.
I am not able to repo this @vasili111 but seem model got loaded, can you add more logging in the generate function to get more info.
With FP16, I meant half precision, which in fact already model is in half precision.
@HamidShojanazeri is the number of GPU the exact requirement or is it the amount of memory?
For example, one 80Gb GPU would fit the 13Gb model, I assume, but it would not respect the 2x GPU requirements.
Would having only 1 GPU be a limiting case here? If so, is there a way around it?
@EmanuelaBoros ,Thank you. It really helps!
My hardware
When running
nvidia-smi
this is the output:example_chat_completion.py
example_chat_completion.py
file used below with all models is unmodified from original repo: https://github.com/facebookresearch/llama/blob/main/example_chat_completion.pyllama-2-7b-chat I am following README.md and succesfully run "llama-2-7b-chat" model with:
With "llama-2-7b-chat" everything works well.
llama-2-13b-chat Now I am trying to modify code above that runs "llama-2-7b-chat" to run "llama-2-13b-chat":
After running it this is the output:
and after nothing happens.
llama-2-70b-chat Also, I am trying to run "llama-2-70b-chat":
but getting following error:
Question: How to correctly run "llama-2-13b-chat" and "llama-2-70b-chat"?