Open devkaranjoshi opened 1 year ago
@devkaranjoshi, offloading for inference (a.k.a., zero-inference) is enabled using deepspeed.initialize()
api, not deepspeed.initi_inference()
. Please see an example for bloom-176b inference here.
@tjruwase Thanks, But i have few questions:
@devkaranjoshi, please see this paper and doc for description of deepspeed inference solutions.
deepspeed.initialize()
is meant for inference democratization to support large models on few gpus. This is more useful for throughput-oriented inference. deepspeed.init_inference()
is meant for fast, latency-critical inference which requires enough gpus to fit the model.In general, it would be useful for you to open new tickets with repro steps and logs to help investigate your questions. Thanks!
@tjruwase
Using Vicuna 7b model on 25GB GPU.
import os import torch os.environ["RANK"] = "0" os.environ["WORLD_SIZE"] = "1" local_rank = int(os.getenv("LOCAL_RANK", "0")) world_size = int(os.getenv("WORLD_SIZE", "1")) torch.cuda.set_device(local_rank) deepspeed.init_distributed("nccl")
config: { "fp16": { "enabled": True, }, "bf16": { "enabled": False, }, "zero_optimization": { "stage": 3, "offload_param": { "device": "cpu", "pin_memory": True, }, "overlap_comm": True, "allgather_partitions": True, "contiguous_gradients": True, "allgather_bucket_size": 2e8, "reduce_scatter": True, "reduce_bucket_size": 2e8, "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": "auto", "stage3_max_reuse_distance": "auto", }, "steps_per_print": 2000, "train_batch_size": 1, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": False }
l_model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.float16)
dschf = HfDeepSpeedConfig(config) ds_engine = deepspeed.initialize(model=l_model, config_params=config)[0] ds_engine.module.eval() model = ds_engine.module
tokenizer = AutoTokenizer.from_pretrained(path, padding=True) tokenizer.pad_token = "[PAD]"
def generate():
"""returns a list of zipped inputs, outputs and number of new tokens"""
inputs= [
"DeepSpeed is a machine learning framework",
"He is working on",
"He has a",
"He got all",
"Everyone is happy and I can",
"The new movie that got Oscar this year",
"In the far far distance from our galaxy,",
"Peace is the only way",
]
generate_kwargs = dict(max_new_tokens=200, do_sample=False)
input_tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
for t in input_tokens:
if torch.is_tensor(input_tokens[t]):
input_tokens[t] = input_tokens[t].to("cuda:0")
outputs = l_model.generate(**input_tokens, **generate_kwargs)
input_tokens_lengths = [x.shape[0] for x in input_tokens.input_ids]
output_tokens_lengths = [x.shape[0] for x in outputs]
total_new_tokens = [o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)]
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
return outputs
please correct me if i am wrong.
Thanks for providing more details. Yes, token generation with offloading suffers from high-latency due to the PCIe bandwidth limitations. Thus, offloading-based generation is targeted for scenarios with insufficient gpu memory to fit model or that are throughput-bound. Since 7b model fits into 25GB GPU, it is better to use the latency-optimized inference option of deepspeed.init_inference()
.
I want to use deepspeed for inference but i am not able to correctly load the model using deepspeed. As per my understanding of theory, deepspeed should load all the model weights on cpu or Nvme. But whenever i run this scipt(Attached with this message), All the model weights are first loaded on the CPU and then it straight transfers model weight on GPU and it runs CUDA out of memory. The command i am using to run the code below:
RUN: deepspeed --num_gpus 1 deepspeed_test.py
System Requirements: I am using 24gb GPU and CPU with 100 GB ram. Model: llama-13b
This is the piece of code, I am using:
import os import deepspeed import torch from transformers import pipeline
local_rank = int(os.getenv('LOCAL_RANK', '0')) world_size = int(os.getenv('WORLD_SIZE', '1')) generator = pipeline('text-generation', model='model_path/llama-13b/')
generator.model = deepspeed.init_inference(generator.model, mp_size=world_size, dtype=torch.half, replace_method = 'auto', replace_with_kernel_inject=True)
string = generator("DeepSpeed is", do_sample=True, min_length=50) if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: print(string)
Please help me out with this, As you are well aware of the functionalities.