microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.9k stars 175 forks source link

Benchmarking MII performance #204

Open Mutinifni opened 1 year ago

Mutinifni commented 1 year ago

Hello,

I'm trying to benchmark inference performance of various LLMs using MII.

I load models using:

import mii
mii_configs = {"tensor_parallel": 2, "dtype": "fp16", "max_tokens": 1500, "load_with_sys_mem": True}
mii.deploy(task="text-generation", model="EleutherAI/gpt-neox-20b", model_path="/home/azureuser/mii_models", deployment_name="mdl", mii_config=mii_configs)

And my benchmark script looks like

import mii
import time

batch_size = 32
input = ["Seattle is"] * batch_size
generator = mii.mii_query_handle("mdl")

times = []
for _ in range(100):
    start = time.time()
    result = generator.query({"query": input}, do_sample=True, min_new_tokens=100, max_new_tokens=100)
    end = time.time()
    times.append(end - start)

Note that I'm reusing a small input and generating a fixed number of tokens to ensure consistency. Maybe this is not the best way to go about it, if so, please do let me know!

When running the above, my MII model server displays the following error after a few inferences:

/home/azureuser/miniconda3/envs/gptneox/lib/python3.8/site-packages/transformers/pipelines/base.py:1080: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
  warnings.warn(

I have two questions:

  1. Should I be running my benchmark differently to avoid the warning above? I couldn't find any documentation that lets me use Huggingface transformer pipelines for MII inference (I believe it is just used under the hood?)
  2. What is the best way to maximize GPU utilization / throughput, given a batch size? I tried sending requests concurrently using multiple instances of the above benchmarking script, but the GPU utilization still remains the same and the latency approximately doubles (monitored using nvidia-smi).

Any pointers would be appreciated -- thanks!

mrwyattii commented 1 year ago
  1. The way you are running the benchmarks closely matches how we do so internally. You are correct, transformer.pipeline is used under the hood to load the tokenizer, model, etc. I believe the reason you are seeing this warning is because you are using multiple samples for the input. Under the hood, this will essentially feed each sample into the model, calling generate one at a time. You may be able to get around this by providing the batch_size parameter to generator.query (it will be passed to the transformers.pipeline object by MII) - however I have not tested this recently. We avoid this problem by using a single sample when we run benchmarks (i.e., batch size 1) and instead generate a large number of tokens, allowing us to measure the per-token latency.
  2. It's hard to say what the exact settings should be if your goal is to maximize GPU utilization for a sustained time. I believe adding the batch_size param would be a good start. Increasing the number of tokens and using larger models would also be some obvious ways to increase utilization. What GPU are you using for these benchmarks? I can do some testing on my side to help elaborate more!

Another thing that may help: We have timers built into MII and DeepSpeed that you can utilize to extend your results. In particular, result.time_taken will measure the server-side time (see here for implementation details: https://github.com/microsoft/DeepSpeed-MII/blob/dc5ab44dfa48ae9f0a99b356e96c8849c0c78aea/mii/grpc_related/modelresponse_server.py#L85) and result.model_time_taken will measure the forward pass time in DeepSpeed (see here for implementation details: https://github.com/microsoft/DeepSpeed/blob/4cd0a003f5b6744a3455c34ad0d20364a8627b30/deepspeed/inference/engine.py#L218 and https://github.com/microsoft/DeepSpeed-MII/blob/dc5ab44dfa48ae9f0a99b356e96c8849c0c78aea/mii/grpc_related/modelresponse_server.py#L47).

We are also looking into adding our benchmark code into the MII repository. I will keep you updated on any progress here. Thanks

Mutinifni commented 1 year ago

Thank you for all the pointers!

  1. I did try passing batch_size to generator.query before; however, it results in this error (for GPT-NeoX-20b):
    Exception calling application: Pipeline with tokenizer without pad_token cannot do batching. You can try to set it with `pipe.tokenizer.pad_token_id = model.config.eos_token_id`.

    Is this something MII allows setting without modifying the internal pipeline src?

EDIT: Setting batch size for OPT yields

Exception calling application: The specified pointer resides on host memory and is not registered with any CUDA device
  1. I'm using 8xA100 40GB GPUs. I did try to increase the number of generated tokens before also, but that simply increased the overall request latency and not the GPU utilization. Larger models generally do help.

Am I correct in understanding that there is no support for concurrent processing of inference requests? Would multiple model instances have to be loaded on the GPU for that to work?

muhammad-asn commented 1 year ago

How to apply the concurrent processing for inference requests? Glad to hear your response sir. Thank you