microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.3k stars 4.09k forks source link

How to do Inference on Multi Node with Multi GPUs using deepspeed? #6483

Closed shreyasp-07 closed 1 month ago

shreyasp-07 commented 1 month ago

I have two servers configured as follows: Server-1 with 6 GPUs and Server-2 with 4 GPUs, each GPU having 24GB of VRAM. I'm attempting to load the LLaMA-3.1-70B model across both servers using DeepSpeed in conjunction with the Transformers pipeline. However, the model is not loading as expected; instead of distributing the model across the GPUs, it is consuming the full memory on all 10 GPUs across both servers.

Below is the code I am using for inference, but I'm uncertain if the configuration is correct or if this is the proper approach for inference.

ds_config = {
    "train_batch_size": 1,
    "fp16": {
        "enabled": True
    },
    "zero_optimization": {
        "stage": 3,  # Using stage 3 to enable full sharding
        "offload_param": {
            "device": "none",  # Keep parameters on GPU
            "pin_memory": False  # No need to pin memory if not using CPU offload
        },
        "offload_optimizer": {
            "device": "none",  # Keep optimizer states on GPU
            "pin_memory": False  # No need to pin memory if not using CPU offload
        },
        "contiguous_gradients": True,
        "overlap_comm": True,
        "reduce_scatter": True,
        "reduce_bucket_size": 1e8,
        "allgather_bucket_size": 1e8
    },
    "tensor_parallel": {
        "tp_size": 10  # Total number of GPUs across both servers
    },
    "zero_allow_untested_optimizer": True,
    "steps_per_print": 2000,
    "wall_clock_breakdown": False
}

# Initialize HfDeepSpeedConfig using the config dictionary
ds_config_obj = HfDeepSpeedConfig(ds_config)

# Initialize DeepSpeed with the model
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config_params=ds_config,
    model_parameters=model.parameters(),
    dist_init_required=True  # Ensures initialization for distributed training
)

Could someone assist with resolving memory issues and configuring inference across multiple nodes with multiple GPUs?

jomayeri commented 1 month ago

@shreyasp-07 The DeepSpeed config doesn't natively support a tensor_paralllel configuration. For multi-gpu inference with DeepSpeed, take a look at FastGen

shreyasp-07 commented 1 month ago

@jomayeri Can we do Multi-Node using Fastgen? Also After running the script It is not actually splitting the model across GPUs.

Code

import mii
pipe = mii.pipeline("meta-llama/Meta-Llama-3.1-8B-Instruct")
response = pipe(["What is the weather today?"])
print(response)

Command:

deepspeed --num_gpus 4 script.py

Expected : 0 : 6GB , 1 : 8GB , 2 : 8GB , 3 : 8GB Observation : 0 : 23GB , 1 : 23GB , 2 : 23GB , 3 : 23GB

What am I missing?

jomayeri commented 1 month ago

Why are those your expected numbers? Are you constraining KV-cache size in some way?

Here is an example of my run:

import mii
pipe = mii.pipeline("meta-llama/Meta-Llama-3-70B")
response = pipe(["What is the weather today?"], max_new_tokens=128)
print(response)

deepspeed --num_gpus 2 mii_inferenece.py

When the model is built and split across GPUs the nvidia-smi looks like this image

The model is 70B * 2 bytes for each parameter equals 140 GB of model size split across 2 GPU is ~70GB per GPU.

Once the model begins generating a response more memory will be used by the generation artifacts (KV-cache). image

shreyasp-07 commented 1 month ago

I have 6 GPUs on server-1 each of 24 GB and 4 GPUs on server-2 each 24 GB. How can I load llama 3.1 70B on my both server?

jomayeri commented 1 month ago

If you have the multi-node environment setup you can use the deepspeed launcher in the same way.