Open Mutinifni opened 1 year ago
transformer.pipeline
is used under the hood to load the tokenizer, model, etc. I believe the reason you are seeing this warning is because you are using multiple samples for the input. Under the hood, this will essentially feed each sample into the model, calling generate
one at a time. You may be able to get around this by providing the batch_size
parameter to generator.query
(it will be passed to the transformers.pipeline
object by MII) - however I have not tested this recently. We avoid this problem by using a single sample when we run benchmarks (i.e., batch size 1) and instead generate a large number of tokens, allowing us to measure the per-token latency.batch_size
param would be a good start. Increasing the number of tokens and using larger models would also be some obvious ways to increase utilization. What GPU are you using for these benchmarks? I can do some testing on my side to help elaborate more!Another thing that may help:
We have timers built into MII and DeepSpeed that you can utilize to extend your results. In particular, result.time_taken
will measure the server-side time (see here for implementation details: https://github.com/microsoft/DeepSpeed-MII/blob/dc5ab44dfa48ae9f0a99b356e96c8849c0c78aea/mii/grpc_related/modelresponse_server.py#L85) and result.model_time_taken
will measure the forward pass time in DeepSpeed (see here for implementation details: https://github.com/microsoft/DeepSpeed/blob/4cd0a003f5b6744a3455c34ad0d20364a8627b30/deepspeed/inference/engine.py#L218 and https://github.com/microsoft/DeepSpeed-MII/blob/dc5ab44dfa48ae9f0a99b356e96c8849c0c78aea/mii/grpc_related/modelresponse_server.py#L47).
We are also looking into adding our benchmark code into the MII repository. I will keep you updated on any progress here. Thanks
Thank you for all the pointers!
batch_size
to generator.query
before; however, it results in this error (for GPT-NeoX-20b):
Exception calling application: Pipeline with tokenizer without pad_token cannot do batching. You can try to set it with `pipe.tokenizer.pad_token_id = model.config.eos_token_id`.
Is this something MII allows setting without modifying the internal pipeline src?
EDIT: Setting batch size for OPT yields
Exception calling application: The specified pointer resides on host memory and is not registered with any CUDA device
Am I correct in understanding that there is no support for concurrent processing of inference requests? Would multiple model instances have to be loaded on the GPU for that to work?
How to apply the concurrent processing for inference requests? Glad to hear your response sir. Thank you
Hello,
I'm trying to benchmark inference performance of various LLMs using MII.
I load models using:
And my benchmark script looks like
Note that I'm reusing a small input and generating a fixed number of tokens to ensure consistency. Maybe this is not the best way to go about it, if so, please do let me know!
When running the above, my MII model server displays the following error after a few inferences:
I have two questions:
nvidia-smi
).Any pointers would be appreciated -- thanks!