Open DerrickYLJ opened 2 weeks ago
Hi @DerrickYLJ in your torchrun call you need to specify the --nproc_per_node to your number of GPU. It will spin up a process for each GPU to split the model.
The same problem, when I set the --nproc_per_node to 8, it will get an error:"AssertionError: Loading a checkpoint for MP=1 but world size is 8".
Hi @DerrickYLJ in your torchrun call you need to specify the --nproc_per_node to your number of GPU. It will spin up a process for each GPU to split the model.
Yes, I have tried that but it will output the assertion failure exactly the same in another comment.
I think that the problem is due to Llama3-8B-Instruct only has one checkpoint file? So how does set nproc_per_node will help, or more specifically, how can we solve this?
Thank you!
Sorry @ISADORAyt wasn't paying attention that @DerrickYLJ was loading the 8B model. The code in this repo is only able to load the 8B on a single GPU and the 70B model on 8 GPUs. To run different splits you'll need to look into different engine like vllm which you can either run standalone or through TorchServe's integration https://github.com/pytorch/serve?tab=readme-ov-file#-quick-start-llm-deployment
I think that the problem is due to Llama3-8B-Instruct only has one checkpoint file? So how does set nproc_per_node will help, or more specifically, how can we solve this?
@DerrickYLJ Please see above, I misread your initial post.
Describe the bug
I am currently building the model from the source for the model -
meta-llama/Meta-Llama-3-8B-Instruct
:However, only GPU 0 will store the model but all others are empty. Supposing nothing else has been changed, I wonder how I can load this particular model on multiple GPUs (like how
device_map="auto"
works when loading a normal model.) (I have tried to use accelerate.load_checkpoint_in_model but it didn't work)Minimal reproducible example
Output
It will load the whole model on a single GPU card.
Runtime Environment
meta-llama-3-8b-instruct
Additional context Thanks a lot!