Closed Jone-Luo closed 3 months ago
Hi , Thanks for your interest in our work. When you use the torchrun command did you set nproc_per_node as 2 ?
The read me has the correct nproc_per_node values for each model size.
I have set nproc_per_node as 2, then I get the error:
AssertionError: Loading a checkpoint for MP=0 but world size is 2
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Set MP=2 is that mean we need 2 GPU?
Yes , as I understand, you would need two GPUs. This has been discussed in Llama issues as well. https://github.com/meta-llama/llama/issues/415
Someone pointed out that using the model weights from hugging face allowed them to run it on a single GPU. Maybe you can try that out
OK, thank you very much!
Thanks for your great work. I've run it successfully with llama 7B. However, when I change the model size into 13B I got the error:
AssertionError: Loading a checkpoint for MP=0 but world size is 1
Do you have any suggestion for this issue?