Open Agrawalchitranshu opened 2 months ago
I have done some benchmark with LLMPerf, with 150 requests of 1000 input tokens& 500 output tokens. (Cuda12.4. Nvidia Driver 550) | GPU Model | LLM model | vLLM version | 1 concurrent request (token/s) | 10 concurrent requests (token/s) |
---|---|---|---|---|---|
RTX4090 | llama3.1 8B fp8 | v0.5.3 | 90 | 714 | |
RTX4090 | llama3.1 8B fp8 | v0.6.0 | 87 | 680 | |
RTX4090 | llama3 8B fp16 | v0.4.1 | / | 484 | |
RTX4090 | llama3 8B fp16 | v0.6.0 | / | 488 |
for llama3.1 8B, I set --max-model-len 80000
All versions of vLLM are from the official images in docker hub.
Benchmark is an art form, and the percentage improvement in throughput varies with different model sizes and graphics cards. vllm0.6.0 is optimized for scenarios with high throughput, particularly where CPU load is significant. You can refer to the testing script provided by sglang for throughput testing. Here is the link: https://github.com/sgl-project/sglang/blob/main/python/sglang/bench_serving.py
Hello! For the Llama 3.1 70B AWQ 4bit model on 1 x A100, version 0.6.0 even became a little worse. I conduct a test using the comparative benchmark_throught.py: Version 0.6.0 - {'elapsed_time': 305.8413491959218, 'num_requests': 10, 'requests_per_seconds': 0.03269669070676904} version 0.5.5 - {'elapsed_time': 287.37649882701226, 'num_requests' : 10, 'requests_per_sec ': 0.03479755665761504}. That is, there are no improvements for quantum models. The same data for all requests. Input tokens can have from 10 to 27k tokens, and output = 512, max_model_len=32000.
As per vllm community, vllm 0.6.0 is improved version with 5x throughput. I have installed vllm==0.6.0 but the throughput remains same as earlier. Also the response quality of output is degraded in this version. Has anyone faced similar issue with this version?