vcskaushik / LLMzip

GNU General Public License v3.0
41 stars 1 forks source link

How to change the model size? #3

Closed Jone-Luo closed 3 months ago

Jone-Luo commented 6 months ago

Thanks for your great work. I've run it successfully with llama 7B. However, when I change the model size into 13B I got the error:

AssertionError: Loading a checkpoint for MP=0 but world size is 1

Do you have any suggestion for this issue?

vcskaushik commented 6 months ago

Hi , Thanks for your interest in our work. When you use the torchrun command did you set nproc_per_node as 2 ?

The read me has the correct nproc_per_node values for each model size.

Jone-Luo commented 6 months ago

I have set nproc_per_node as 2, then I get the error:

AssertionError: Loading a checkpoint for MP=0 but world size is 2 torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Set MP=2 is that mean we need 2 GPU?

vcskaushik commented 6 months ago

Yes , as I understand, you would need two GPUs. This has been discussed in Llama issues as well. https://github.com/meta-llama/llama/issues/415

Someone pointed out that using the model weights from hugging face allowed them to run it on a single GPU. Maybe you can try that out

Jone-Luo commented 6 months ago

OK, thank you very much!