vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.23k stars 4.18k forks source link

High latency when using llama-2-13b-chat-hf on AWS Sagemaker #894

Closed pri2si17-1997 closed 1 year ago

pri2si17-1997 commented 1 year ago

Hi Team,

I am using meta-llama/Llama-2-13b-chat-hf with tensor_parallel_size=4 on AWS Sagemaker notebook instance with ml.g5.12xlarge which has 4 NVIDIA A10G GPUs 23 GB memory each. With recent release it's taking longer time to generate the text.

Here is the screenshot with the environment:

image

Here is the screenshot with prompt and timing and GPU consumption once I instantiated the model:

image

It took 41.6 seconds which is way slow. I have tested earlier it's done within seconds.

However, if I try it with tensor_parallel_size=2, the response time reduces to half around 20.2 seconds.

image

I feel that something is weird happening with parallelization. Please help me fix this issue.

Regards Priyanshu Sinha

@WoosukKwon

pri2si17-1997 commented 1 year ago

Closing issue as it was cuda version conflict. Creating a new conda environment helped.