vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.66k stars 4.65k forks source link

[Usage]: Optimizing TTFT for Qwen2.5-72B Model Deployment on A800 GPUs for RAG Application #10527

Open zhanghx0905 opened 1 day ago

zhanghx0905 commented 1 day ago

Your current environment

Below is my current Docker Compose configuration:

services:
  vllm:
    image: vllm/vllm-openai:v0.6.4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['2', '3']
              capabilities: [gpu]
    ipc: host
    command: 
      - "--model"
      - "qwen/Qwen2___5-72B-Instruct-GPTQ-Int4"
      - "--gpu-memory-utilization" 
      - "0.9"
      - "--served-model-name" 
      - "qwen2.5-72b"
      - "--enable-auto-tool-choice"
      - "--tool-call-parser"
      - "hermes"
      - "--tensor-parallel-size"
      - "2"
      - "--enable-prefix-caching"
      - "--multi-step-stream-outputs"
      - "False"

How would you like to use vllm

I'm currently deploying the Qwen2.5-72B model using VLLM on two NVIDIA A800 GPUs for a Retrieval-Augmented Generation (RAG) application. The load on the model involves processing token sequences greater than 6000 tokens. My goal is to reduce the Time to First Token (TTFT) for the service to improve overall performance and responsiveness.

The current configuration results in a considerably high TTFT, which is impacting the overall performance of my RAG application. I would like to optimize the configuration to reduce the TTFT to improve the service's responsiveness.

Thank you for your assistance!

Before submitting a new issue...

Playerrrrr commented 16 hours ago

+1

Playerrrrr commented 16 hours ago

want the a100 optimal config for this haha