vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.94k stars 4.13k forks source link

TGI performance is better than vllm on A800 #262

Closed jameswu2014 closed 6 months ago

jameswu2014 commented 1 year ago

I use benchmark_serving as client, api_server for vllm, text_generation_server for TGI, the client cmd is listed below: " python benchmark_serving.py --backend tgi/vllm --tokenizer /data/llama --dataset /data/ShareGPT_V3_unfiltered_cleaned_split.json --host 10.3.1.2 --port 8108 --num-prompts 1000"

Why I get the result that TGI is 2x better than vllm?

zhuohan123 commented 1 year ago

Hi, can you provide the command you use to start the TGI and vLLM servers? What is the model you are using?

In addition, can you paste the full result logs?

I believe A800 is the same as A100 in terms of single-GPU performance. This result is not expected.

jameswu2014 commented 1 year ago

Hi,the commands: tig: "docker run --gpus device=7 --shm-size 1g -d -p 8108:80 -v /home/wzy/data:/data ghcr.io/huggingface/text-generation-inference:0.8 --model-id /data/llama --num-shard 1 --max-total-tokens 3072 --max-input-length 1024 --max-concurrent-requests 5000 --max-batch-total-tokens 32000"

tig result: Total time: 455.54 s Throughput: 2.20 requests/s Average latency: 224.83 s Average latency per token: 0.84 s Average latency per output token: 5.81 s

vllm: "python -m vllm.entrypoints.api_server --model /data/llama --swap-space 16 --disable-log-requests"

vllm result: Total time: 971.95 s Throughput: 1.03 requests/s Average latency: 337.35 s Average latency per token: 0.98 s Average latency per output token: 5.07 s

By the way, my model is an standard llama-7B model.
yuhai-china commented 1 year ago

I belive that you test is not corret, for llama-7B model ,2.20 requests/s or 1.03 requests/s are both too low .

jameswu2014 commented 1 year ago

My server cmd is not correct,Or something else? If I modify 'request_rate',the result will be different.So how to set the request_rate?

jameswu2014 commented 1 year ago

I use benchmark_throughout.py test, the result is about 111reqs/min,it is correct(close to declared).

wangzhiwei-ai commented 1 year ago

I use ab to test, the result is about 97 requests/min on A100 40G 1*gpu.

zhuohan123 commented 1 year ago

I feel this issue is similar to the results in #275. Can you double-check your tokenizer or test a standard LLaMA model on huggingface like huggyllama/llama-7b?

lucasjinreal commented 1 year ago

I think this is not TGI better, but vllm result are some sort miss aligned with huggingface's transformers.

Not sure its a bug or a feature, but certainly the result is not appearing than hf original. Some question obviously unable to answer than hf's, with same params set up. If vllm can not address this issue, it would be more safety move to TGI, however, things actually weired since TGI are using vllm inside.....

Lvjinhong commented 9 months ago

I think this is not TGI better, but vllm result are some sort miss aligned with huggingface's transformers.

Not sure its a bug or a feature, but certainly the result is not appearing than hf original. Some question obviously unable to answer than hf's, with same params set up. If vllm can not address this issue, it would be more safety move to TGI, however, things actually weired since TGI are using vllm inside.....

Hello, may I ask if this bug has been fixed as of now? I am more inclined to believe that the performance of vLLM is still better than that of TGI, especially in terms of latency.