vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.92k stars 4.7k forks source link

[Performance]: Maximizing the performance of batch inference of big models on vllm 0.6.3 #9383

Open Hellisotherpeople opened 1 month ago

Hellisotherpeople commented 1 month ago

Misc discussion on performance

Hi all, I'm having trouble with maximizing the performance of batch inference of big models on vllm 0.6.3

(Llama 3.1 70b, 405b, Mistral large)

My command to run the server is this: "python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-Large-Instruct-2407 --tensor-parallel-size 4 --guided-decoding-backend lm-format-enforcer --enable-chunked-prefill --enable-prefix-caching "

Specifically, I'm running on 4xA100 80GB hardware

I launch requests with a large min_tokens and max_tokens value (30,000) I do n = 8 to get 8 responses and to run them in parallel

It appears that despite the same min and max token values, that my Avg generation throughput starts very high (~100+) and scales down slowly overtime to a crawl (I was seeing 4 tokens/s before I stopped the generation with mistral large). This is making it take a prohibitively long time to get outputs.

I used to have max_tokens set to a very high value but min_tokens set low, and the model usually gave short outputs but was able to consistently keep high tok/s

I need to get outputs in a reasonable time. Setting n lower cripples my t/s and this doesn't appear to be a GPU memory issue, lowering min/max tokens isn't an option (outputs need to be very long). What configuration/settings changes can I do to optimize my inference environment for how I am doing inference?

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Collecting environment information... Traceback (most recent call last): File "/home/lain/collect_env.py", line 743, in main() File "/home/lain/collect_env.py", line 722, in main output = get_pretty_env_info() ^^^^^^^^^^^^^^^^^^^^^ File "/home/lain/collect_env.py", line 717, in get_pretty_env_info return pretty_str(get_env_info()) ^^^^^^^^^^^^^^ File "/home/lain/collect_env.py", line 549, in get_env_info vllm_version = get_vllm_version() ^^^^^^^^^^^^^^^^^^ File "/home/lain/collect_env.py", line 270, in get_vllm_version from vllm import version, version_tuple ImportError: cannot import name 'version_tuple' from 'vllm' (/home/lain/micromamba/envs/vllm/lib/python3.11/site-packages/vllm/init.py)

(seems your script is broken)

Vllm 0.6.3

"python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-Large-Instruct-2407 --tensor-parallel-size 4 --guided-decoding-backend lm-format-enforcer --enable-chunked-prefill --enable-prefix-caching "

Before submitting a new issue...

sir3mat commented 1 month ago

Have you tested the output with vllm 0.6.3 on longer inputs, ranging from 8k up to 100k tokens? I tested both versions 0.6.3 and 0.6.3.post1 using models like LLaMA 3 70B with a 128k context and LLaMA 3.2 128k, but both versions produce random tokens as output.

Interestingly, when I run the same tests with vllm 0.6.2, it works as expected.