High latency when using llama-2-13b-chat-hf on AWS Sagemaker

Hi Team,

I am using meta-llama/Llama-2-13b-chat-hf with tensor_parallel_size=4 on AWS Sagemaker notebook instance with ml.g5.12xlarge which has 4 NVIDIA A10G GPUs 23 GB memory each. With recent release it's taking longer time to generate the text.

Here is the screenshot with the environment:

Here is the screenshot with prompt and timing and GPU consumption once I instantiated the model:

It took 41.6 seconds which is way slow. I have tested earlier it's done within seconds.

However, if I try it with tensor_parallel_size=2, the response time reduces to half around 20.2 seconds.

I feel that something is weird happening with parallelization. Please help me fix this issue.

Regards Priyanshu Sinha

@WoosukKwon

vllm-project / vllm

High latency when using llama-2-13b-chat-hf on AWS Sagemaker #894