vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.29k stars 4.75k forks source link

[Usage]: Throughput and quality issue with vllm 0.6.0. #8284

Open Agrawalchitranshu opened 2 months ago

Agrawalchitranshu commented 2 months ago

As per vllm community, vllm 0.6.0 is improved version with 5x throughput. I have installed vllm==0.6.0 but the throughput remains same as earlier. Also the response quality of output is degraded in this version. Has anyone faced similar issue with this version?

cpwan commented 2 months ago
I have done some benchmark with LLMPerf, with 150 requests of 1000 input tokens& 500 output tokens. (Cuda12.4. Nvidia Driver 550) GPU Model LLM model vLLM version 1 concurrent request (token/s) 10 concurrent requests (token/s)
RTX4090 llama3.1 8B fp8 v0.5.3 90 714
RTX4090 llama3.1 8B fp8 v0.6.0 87 680
RTX4090 llama3 8B fp16 v0.4.1 / 484
RTX4090 llama3 8B fp16 v0.6.0 / 488

for llama3.1 8B, I set --max-model-len 80000

All versions of vLLM are from the official images in docker hub.

cherishhh commented 2 months ago

Benchmark is an art form, and the percentage improvement in throughput varies with different model sizes and graphics cards. vllm0.6.0 is optimized for scenarios with high throughput, particularly where CPU load is significant. You can refer to the testing script provided by sglang for throughput testing. Here is the link: https://github.com/sgl-project/sglang/blob/main/python/sglang/bench_serving.py

HelenaSak commented 2 months ago

Hello! For the Llama 3.1 70B AWQ 4bit model on 1 x A100, version 0.6.0 even became a little worse. I conduct a test using the comparative benchmark_throught.py: Version 0.6.0 - {'elapsed_time': 305.8413491959218, 'num_requests': 10, 'requests_per_seconds': 0.03269669070676904} version 0.5.5 - {'elapsed_time': 287.37649882701226, 'num_requests' : 10, 'requests_per_sec ': 0.03479755665761504}. That is, there are no improvements for quantum models. The same data for all requests. Input tokens can have from 10 to 27k tokens, and output = 512, max_model_len=32000.