Issue 1: For example, the model downloaded from the Meta official website via the provided URL in download.sh results in 8 separate models. When trying to run the code using example_chat_completion.py in conjunction with the Llama folder provided by the official website, I only have 2 GPUs available and find that it cannot run. Does this mean that the 70B model, which consists of 8 models, requires 8 GPUs to run, and cannot be run on a machine with only 2 GPUs? How should the code be modified to run with only two GPUs?

Issue 2: The model I downloaded from the Meta official website using the URL provided in download.sh appears to be different from the one on Hugging Face; it is the original model as described by Hugging Face. According to Hugging Face’s explanation: “This repository contains two versions of Meta-Llama-3.1-70B-Instruct, for use with transformers and with the original Llama codebase.” Therefore, do I need to download the Llama folder and use example_chat_completion.py to run it?

Issue 1 you probably downloaded all 8 models. 8b 8b instruct 70b 70b instruct 405b 405b instruct thats 6 models. who knows really what your talking about. lets focus on the 8b model ity has this fiels if yoru download form meta /home/myles/.llama/checkpoints/Meta-Llama3.1-8B-Instruct/checklist.chk /home/myles/.llama/checkpoints/Meta-Llama3.1-8B-Instruct/consolidated.00.pth /home/myles/.llama/checkpoints/Meta-Llama3.1-8B-Instruct/origparams.json /home/myles/.llama/checkpoints/Meta-Llama3.1-8B-Instruct/params.json /home/myles/.llama/checkpoints/Meta-Llama3.1-8B-Instruct/tokenizer.model and to run it you would type this: torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir /home/myles/.llama/checkpoints/Meta-Llama3.1-8B-Instruct/ --tokenizer_path /home/myles/.llama/checkpoints/Meta-Llama3.1-8B-Instruct/tokenizer.model --max_seq_len 128 --max_batch_size 4 however taht then runs on one 24gb gpu. since the 70b models are larger you need more vram. as it will nto fit on one 24gb gpu.maybe 8? or two 80gb A100gpus. in which hcase the mdoel would be sharded over 8 gpus or two gpus. issue 2. the model is just in different format one is huggingface transformer format and the other is some meta format or pytorch or whateverhttps://github.com/meta-llama/llama-models?tab=readme-ov-file#download

!/bin/bash

NGPUS=8 PYTHONPATH=$(git rev-parse --show-toplevel) torchrun \ --nproc_per_node=$NGPUS \ models/scripts/example_chat_completion.py $CHECKPOINT_DIR \ --model_parallel_size $NGPUSsomthing.

meta-llama / llama3

Issues with torchrun --nproc_per_node num Command and Llama 3.1 Model Conflicts #322

!/bin/bash