microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.64k stars 4.04k forks source link

[BUG] Excessive CPU and GPU Memory Usage with Multi-GPU Inference Using DeepSpeed #5793

Open gawain000000 opened 1 month ago

gawain000000 commented 1 month ago

I am experiencing excessive CPU and GPU memory usage when running multi-GPU inference with DeepSpeed. Specifically, the memory usage does not scale as expected when increasing the number of GPUs. Below is the code I am using for inference:

import os
import torch
import deepspeed
import time
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
from deepspeed.runtime.zero.config import DeepSpeedZeroConfig
from deepspeed.inference.config import DeepSpeedTPConfig
from deepspeed.runtime.utils import see_memory_usage

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))

model_dir = "/mnt/sgnfsdata/tolo-03-97/pretrained_models/internlm2-chat-20b"
trust_remote_code = True
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=trust_remote_code)
config = AutoConfig.from_pretrained(model_dir, trust_remote_code=trust_remote_code)
model = AutoModelForCausalLM.from_pretrained(model_dir,
                                             torch_dtype=torch.bfloat16,
                                             trust_remote_code=trust_remote_code
                                             )

model = model.eval()
see_memory_usage("After load model", force=True)

tp_config = DeepSpeedTPConfig(tp_size=world_size)
zero_config = DeepSpeedZeroConfig(stage=3,
                                  model_persistence_threshold=0,
                                  max_live_parameters=0,
                                  mics_shard_size=world_size
                                  )
ds_engine = deepspeed.init_inference(model=model,
                                     tensor_parallel=tp_config,
                                     dtype=torch.bfloat16,
                                     zero=zero_config,
                                     max_out_tokens=1024,
                                     replace_method="auto",
                                     replace_with_kernel_inject=True)

see_memory_usage("After DS-inference init", force=True)

model = ds_engine.module
print("device: ", model.device)
prompt = "what is deepspeed?"
t0 = time.time()
response = model.chat(tokenizer=tokenizer,
                      query=prompt,
                      history=[],
                      max_new_tokens=1024,
                      do_sample=True,
                      temperature=0.8,
                      top_p=0.8
                      )
t1 = time.time()
print(response)
print('=' * 100)
print("inference time: ", t1 - t0)
print('=' * 100)

Steps to Reproduce:

  1. Run the script with 2 GPUs:

    deepspeed --num_gpus 2 main.py --ds_inference

    image image

  2. Run the script with 4 GPUs:

    deepspeed --num_gpus 4 main.py --ds_inference

    image image

Expected Behavior: I expected that using 4 GPUs would reduce the memory usage per GPU, ideally halving the GPU memory usage compared to running with 2 GPUs.

Actual Behavior:

With 2 GPUs:
    CPU virtual memory: 92.87GB
    Each GPU memory: 37.74GB

With 4 GPUs:
    CPU virtual memory: 162.92GB (significantly higher than expected)
    Each GPU memory: 37.74GB (no reduction)

Questions:

Why does the CPU virtual memory usage increase significantly when using more GPUs?
How can I reduce the memory usage per GPU when scaling up the number of GPUs?

System Info:

DeepSpeed version: 0.14.4
PyTorch version: 2.3.1
Transformers version: 4.42.3
Python version: 3.10
OS: ubuntu 24.04

Additional Context: Any insights or suggestions on how to optimize the memory usage for multi-GPU inference with DeepSpeed would be greatly appreciated. Thank you!

tjruwase commented 1 month ago

@gawain000000, can you clarify your goals because there are two different solutions for latency and throughput (and low budget) scenarios. I noticed the use of deepspeed.init_inference and zero stage 3 in your codes, which are not recommended combinations.

  1. ZeRO-Inference for low-budget throughput scenarios: based on zero stage 3 is enabled using deepspeed.initialize(). You can find examples here, and here.

  2. FastGen for latency/throughput scenarios: independent of zero stage 3. You can find doc here.

gawain000000 commented 1 month ago

@tjruwase My goals are the following:

  1. Reduce the latency and throughput of the inference, so we need to use DeepSpeed.
  2. Slice the model across multiple GPUs so that each GPU requires a smaller amount of memory.

The reason for this is that when LLMs perform inference on a long document, they need additional memory for storing the KV cache. Currently, I use L40S GPUs to deploy the LLM, and each GPU has only 46GB of memory. If no model slicing is performed, it will result in an OOM error when processing a document around 7000 tokens. I wonder why DeepSpeed's inference initialization allocates the same amount of memory in both cases—using 2 GPUs and 4 GPUs for deployment—making it impossible to process long documents.