vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.06k stars 3.82k forks source link

critical slowness to reach first token as concurrency grows -- balance fairness vs. throughput? #3096

Open pseudotensor opened 6 months ago

pseudotensor commented 6 months ago

Doing pytest parallel attack on vllm with OpenAI client. Running these on A100 each model, 70b and cabybara ares on 4A100 80GB, Mixtral on 2A100 80GB.

e.g. for 70b we run:

python -m vllm.entrypoints.openai.api_server \
        --port=5000 \
        --host=0.0.0.0 \
        --model=h2oai/h2ogpt-4096-llama2-70b-chat \
        --tokenizer=hf-internal-testing/llama-tokenizer \
        --tensor-parallel-size=4 \
        --seed 1234 \
        --trust-remote-code \
    --max-num-batched-tokens 8192 \
        --download-dir=/workspace/.cache/huggingface/hub

or for Mistral 7b v0.2:

python  -m vllm.entrypoints.openai.api_server \
        --port=5004 \
        --host=0.0.0.0 \
        --model=mistralai/Mistral-7B-Instruct-v0.2 \
        --tensor-parallel-size=1 \
        --seed 1234 \
        --trust-remote-code \
    --max-num-batched-tokens 131072 \
        --download-dir=/workspace/.cache/huggingface/hub

i.e. high batching.

or for Mixtral:

python -m vllm.entrypoints.openai.api_server \
--port=5002 \
--host=0.0.0.0 \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--seed 1234 \
--tensor-parallel-size=2 \
--max-num-batched-tokens=163840 \
--max-log-len=100

also pretty high batching size.

However, while it's understandable that the concurrency increase leads to lower tokens per second, most concerning is the time to first token and how many requests are "unlucky" and take even as long as 250 seconds to get first token.

Can vLLM be changed so that we can balance throughput vs. fairness? In general for high concurrency I think fairness is more critical than total throughput.

image

Here are some results. The code is not pretty, but I can share upon request.

vllmstress2_0_1000_0_4096.csv.final.clean.csv

The code is ugly and prompts are not really right for the model, but you get idea. I removed the actual IPs:

stress_vllm_github.py.zip

pseudotensor commented 6 months ago

Here's another example for 31744 tokens into Mixtral. The time to first token for some "users" is pretty bad.

image

pseudotensor commented 6 months ago

FYI @sh1ng

sh1ng commented 6 months ago

Hi @pseudotensor,

I haven't checked your code yet, but want to add a few small comments first.

I got the following results

python -m vllm.entrypoints.api_server --model h2oai/h2ogpt-4096-llama2-13b-chat --swap-space 32  --disable-log-requests -tp 2 --scheduler-policy reorder --scheduler-reorder-window 0.1
python benchmarks/benchmark_serving.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --backend vllm --model h2oai/h2ogpt-4096-llama2-13b-chat  --request-rate 12
--request-rate --> 200 100 50 25 12 12 https://github.com/vllm-project/vllm/pull/2357 reorder-window=0.1 3
Request throughput(requests/s): 2.3 2.41 2.38 2.39 2.40 2.69 2.28
Input token throughput(tokens/s): 591 598 592 593 596 668 565
Output token throughput(tokens/s): 576 582 576 577 580 651 550
Mean TTFT(ms): 178804 174218 172227 160608 137951 117489 26756
Median TTFT(ms): 176255 171432 170848 158364 136089 118266 21412
P99 TTFT(ms): 382078 372192 366251 346015 299581 256327 73796
Mean TPOT(ms): 4677 4531 4521 4204 3624 3085 793
Median TPOT(ms): 1020 992 985 897 754 656 201
P99 TPOT(ms): 37875 37806 36510 34321 30423 25028 6860
pseudotensor commented 6 months ago

@sh1ng Can you help me understand those results? How do they compare to what I shared? It seems for 12 or 200 "request rate" you get roughly same worst-case TTFT ~ 256s to 382s which is really bad right?

P99 may not be best metric, because if it's really just mostly sequentially back logged, it's not a statistical issue at all, it's a fixed issue that later requests will be lagged.

sh1ng commented 6 months ago

@pseudotensor added request-rate=3 which is the maximum that can handle my 2 RTX 3090.

I see some time that GPU is underloaded.

$ nvidia-smi 
Thu Mar  7 01:09:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:21:00.0 Off |                  N/A |
|100%   84C    P2              278W / 350W|  23929MiB / 24576MiB |     79%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         On | 00000000:4B:00.0 Off |                  N/A |
|100%   83C    P2              298W / 350W|  23021MiB / 24576MiB |     78%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    119177      C   python                                    23846MiB |
|    1   N/A  N/A    122902      C   ray::RayWorkerVllm.execute_method         22938MiB |
+---------------------------------------------------------------------------------------+

that's concerning.

I agree that from an end-user perspective, TTFT is a very important metric especially as we are going to work with super-long prompts.

OmarSayedMostafa commented 5 months ago

@pseudotensor How do you calculate time to first token and generation tokens per seconds with vllm ?

rbgo404 commented 4 months ago

@sh1ng @pseudotensor If my request rate is x, it means I am sending x number of request together. Does it create a batch of x request that the LLM server receives? If not then, how can I send a batch request(in Async)?

AaronFriel commented 4 months ago

@rbgo404 The --scheduler-delay-factor feature is useful for ensuring more requests are processed as a batch, by adding a small delay to scheduling. I'm not sure if that works for the very first request received, because the delay is proportional to latency of previous requests, but even if it's inoperative on the 1st, for a long running server that amortizes out to have no effect.