I am using meta-llama/Llama-2-13b-chat-hf with tensor_parallel_size=4 on AWS Sagemaker notebook instance with ml.g5.12xlarge which has 4 NVIDIA A10G GPUs 23 GB memory each. With recent release it's taking longer time to generate the text.
Here is the screenshot with the environment:
Here is the screenshot with prompt and timing and GPU consumption once I instantiated the model:
It took 41.6 seconds which is way slow. I have tested earlier it's done within seconds.
However, if I try it with tensor_parallel_size=2, the response time reduces to half around 20.2 seconds.
I feel that something is weird happening with parallelization. Please help me fix this issue.
Hi Team,
I am using
meta-llama/Llama-2-13b-chat-hf
withtensor_parallel_size=4
on AWS Sagemaker notebook instance withml.g5.12xlarge
which has 4 NVIDIA A10G GPUs 23 GB memory each. With recent release it's taking longer time to generate the text.Here is the screenshot with the environment:
Here is the screenshot with prompt and timing and GPU consumption once I instantiated the model:
It took 41.6 seconds which is way slow. I have tested earlier it's done within seconds.
However, if I try it with
tensor_parallel_size=2
, the response time reduces to half around20.2 seconds
.I feel that something is weird happening with parallelization. Please help me fix this issue.
Regards Priyanshu Sinha
@WoosukKwon