I'm currently deploying the Qwen2.5-72B model using VLLM on two NVIDIA A800 GPUs for a Retrieval-Augmented Generation (RAG) application. The load on the model involves processing token sequences greater than 6000 tokens. My goal is to reduce the Time to First Token (TTFT) for the service to improve overall performance and responsiveness.
The current configuration results in a considerably high TTFT, which is impacting the overall performance of my RAG application. I would like to optimize the configuration to reduce the TTFT to improve the service's responsiveness.
Thank you for your assistance!
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Your current environment
Below is my current Docker Compose configuration:
How would you like to use vllm
I'm currently deploying the Qwen2.5-72B model using VLLM on two NVIDIA A800 GPUs for a Retrieval-Augmented Generation (RAG) application. The load on the model involves processing token sequences greater than 6000 tokens. My goal is to reduce the Time to First Token (TTFT) for the service to improve overall performance and responsiveness.
The current configuration results in a considerably high TTFT, which is impacting the overall performance of my RAG application. I would like to optimize the configuration to reduce the TTFT to improve the service's responsiveness.
Thank you for your assistance!
Before submitting a new issue...