microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.84k stars 4.05k forks source link

Zero cpu offloading is not working #3764

Open devkaranjoshi opened 1 year ago

devkaranjoshi commented 1 year ago

I want to use deepspeed for inference but i am not able to correctly load the model using deepspeed. As per my understanding of theory, deepspeed should load all the model weights on cpu or Nvme. But whenever i run this scipt(Attached with this message), All the model weights are first loaded on the CPU and then it straight transfers model weight on GPU and it runs CUDA out of memory. The command i am using to run the code below:

RUN: deepspeed --num_gpus 1 deepspeed_test.py

System Requirements: I am using 24gb GPU and CPU with 100 GB ram. Model: llama-13b

This is the piece of code, I am using:

import os import deepspeed import torch from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0')) world_size = int(os.getenv('WORLD_SIZE', '1')) generator = pipeline('text-generation', model='model_path/llama-13b/')

generator.model = deepspeed.init_inference(generator.model, mp_size=world_size, dtype=torch.half, replace_method = 'auto', replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50) if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: print(string)

Please help me out with this, As you are well aware of the functionalities.

tjruwase commented 1 year ago

@devkaranjoshi, offloading for inference (a.k.a., zero-inference) is enabled using deepspeed.initialize() api, not deepspeed.initi_inference(). Please see an example for bloom-176b inference here.

devkaranjoshi commented 1 year ago

@tjruwase Thanks, But i have few questions:

  1. Why model inference is too slow when using deepspeed?
  2. And if i have to use deepspeed.initialize() then what is the purpose of deepspeed.initi_inference()? Please answer my questions.
  3. Also there is not any reduction in consumption of GPU despite of cpu offloading
tjruwase commented 1 year ago

@devkaranjoshi, please see this paper and doc for description of deepspeed inference solutions.

  1. Can you elaborate what you mean by too slow?
  2. deepspeed.initialize() is meant for inference democratization to support large models on few gpus. This is more useful for throughput-oriented inference. deepspeed.init_inference() is meant for fast, latency-critical inference which requires enough gpus to fit the model.
  3. Can you please elaborate on GPU consumption of cpu-offloading?

In general, it would be useful for you to open new tickets with repro steps and logs to help investigate your questions. Thanks!

devkaranjoshi commented 1 year ago

@tjruwase

Using Vicuna 7b model on 25GB GPU.

  1. TO slow means token generation is too slow while prediction using deepspeed cpu offloading Speed: 5.25 min : GPU Consumption: 6389mb Token generated: 1102

import os import torch os.environ["RANK"] = "0" os.environ["WORLD_SIZE"] = "1" local_rank = int(os.getenv("LOCAL_RANK", "0")) world_size = int(os.getenv("WORLD_SIZE", "1")) torch.cuda.set_device(local_rank) deepspeed.init_distributed("nccl")

config: { "fp16": { "enabled": True, }, "bf16": { "enabled": False, }, "zero_optimization": { "stage": 3, "offload_param": { "device": "cpu", "pin_memory": True, }, "overlap_comm": True, "allgather_partitions": True, "contiguous_gradients": True, "allgather_bucket_size": 2e8, "reduce_scatter": True, "reduce_bucket_size": 2e8, "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": "auto", "stage3_max_reuse_distance": "auto", }, "steps_per_print": 2000, "train_batch_size": 1, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": False }

l_model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.float16)

dschf = HfDeepSpeedConfig(config) ds_engine = deepspeed.initialize(model=l_model, config_params=config)[0] ds_engine.module.eval() model = ds_engine.module

tokenizer = AutoTokenizer.from_pretrained(path, padding=True) tokenizer.pad_token = "[PAD]"

def generate():

"""returns a list of zipped inputs, outputs and number of new tokens"""

inputs= [

    "DeepSpeed is a machine learning framework",

    "He is working on",

    "He has a",

    "He got all",

    "Everyone is happy and I can",

    "The new movie that got Oscar this year",

    "In the far far distance from our galaxy,",

    "Peace is the only way",

]

generate_kwargs = dict(max_new_tokens=200, do_sample=False)

input_tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)

for t in input_tokens:
      if torch.is_tensor(input_tokens[t]):
        input_tokens[t] = input_tokens[t].to("cuda:0")

outputs = l_model.generate(**input_tokens, **generate_kwargs)

input_tokens_lengths = [x.shape[0] for x in input_tokens.input_ids]
output_tokens_lengths = [x.shape[0] for x in outputs]

total_new_tokens = [o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)]
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

return outputs

please correct me if i am wrong.

tjruwase commented 1 year ago

Thanks for providing more details. Yes, token generation with offloading suffers from high-latency due to the PCIe bandwidth limitations. Thus, offloading-based generation is targeted for scenarios with insufficient gpu memory to fit model or that are throughput-bound. Since 7b model fits into 25GB GPU, it is better to use the latency-optimized inference option of deepspeed.init_inference().