Closed shreyasp-07 closed 1 month ago
@shreyasp-07 The DeepSpeed config doesn't natively support a tensor_paralllel
configuration. For multi-gpu inference with DeepSpeed, take a look at FastGen
@jomayeri Can we do Multi-Node using Fastgen? Also After running the script It is not actually splitting the model across GPUs.
Code
import mii
pipe = mii.pipeline("meta-llama/Meta-Llama-3.1-8B-Instruct")
response = pipe(["What is the weather today?"])
print(response)
Command:
deepspeed --num_gpus 4 script.py
Expected : 0 : 6GB , 1 : 8GB , 2 : 8GB , 3 : 8GB Observation : 0 : 23GB , 1 : 23GB , 2 : 23GB , 3 : 23GB
What am I missing?
Why are those your expected numbers? Are you constraining KV-cache size in some way?
Here is an example of my run:
import mii
pipe = mii.pipeline("meta-llama/Meta-Llama-3-70B")
response = pipe(["What is the weather today?"], max_new_tokens=128)
print(response)
deepspeed --num_gpus 2 mii_inferenece.py
When the model is built and split across GPUs the nvidia-smi
looks like this
The model is 70B * 2 bytes for each parameter equals 140 GB of model size split across 2 GPU is ~70GB per GPU.
Once the model begins generating a response more memory will be used by the generation artifacts (KV-cache).
I have 6 GPUs on server-1 each of 24 GB and 4 GPUs on server-2 each 24 GB. How can I load llama 3.1 70B on my both server?
If you have the multi-node environment setup you can use the deepspeed launcher in the same way.
I have two servers configured as follows: Server-1 with 6 GPUs and Server-2 with 4 GPUs, each GPU having 24GB of VRAM. I'm attempting to load the LLaMA-3.1-70B model across both servers using DeepSpeed in conjunction with the Transformers pipeline. However, the model is not loading as expected; instead of distributing the model across the GPUs, it is consuming the full memory on all 10 GPUs across both servers.
Below is the code I am using for inference, but I'm uncertain if the configuration is correct or if this is the proper approach for inference.
Could someone assist with resolving memory issues and configuring inference across multiple nodes with multiple GPUs?