[Usage]: Optimizing TTFT for Qwen2.5-72B Model Deployment on A800 GPUs for RAG Application

zhanghx0905 commented 1 day ago

Your current environment

Below is my current Docker Compose configuration:

services:
  vllm:
    image: vllm/vllm-openai:v0.6.4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['2', '3']
              capabilities: [gpu]
    ipc: host
    command: 
      - "--model"
      - "qwen/Qwen2___5-72B-Instruct-GPTQ-Int4"
      - "--gpu-memory-utilization" 
      - "0.9"
      - "--served-model-name" 
      - "qwen2.5-72b"
      - "--enable-auto-tool-choice"
      - "--tool-call-parser"
      - "hermes"
      - "--tensor-parallel-size"
      - "2"
      - "--enable-prefix-caching"
      - "--multi-step-stream-outputs"
      - "False"

How would you like to use vllm

I'm currently deploying the Qwen2.5-72B model using VLLM on two NVIDIA A800 GPUs for a Retrieval-Augmented Generation (RAG) application. The load on the model involves processing token sequences greater than 6000 tokens. My goal is to reduce the Time to First Token (TTFT) for the service to improve overall performance and responsiveness.

The current configuration results in a considerably high TTFT, which is impacting the overall performance of my RAG application. I would like to optimize the configuration to reduce the TTFT to improve the service's responsiveness.

Thank you for your assistance!

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Playerrrrr commented 16 hours ago

+1

Playerrrrr commented 16 hours ago

want the a100 optimal config for this haha

vllm-project / vllm