vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.56k stars 3.18k forks source link

why online seving slower than offline serving?? #2019

Open BangDaeng opened 7 months ago

BangDaeng commented 7 months ago
  1. offline serving image

  2. online serving(fastapi) image image log: INFO 12-11 21:50:36 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0% INFO 12-11 21:50:41 async_llm_engine.py:111] Finished request 261ddff3312f44cd8ee1c52a6acd10e6.

Why is the speed 2 seconds slower when displayed as fastapi?? parameters is same, prompt is same

"Open-Orca/Mistral-7B-OpenOrca" this model same issue and any llama2 model same issue

python : 3.10.12 my library list.txt

cuda_version : 12.0 gpu: a100 40g my library list attached

Lvjinhong commented 6 months ago

@irasin Hello, About https://github.com/vllm-project/vllm/issues/2257#issuecomment-1869400614, Through my testing, In my latest test, when using AsyncLLMEngine, I observed significant fluctuations in GPT-Util (0-100%), but the throughput was high. Previously, when using LLMEngine with bs=1, the utilization was stable between 80-90%. What are your thoughts on this?

I am running Llama 70b on 8*A800 80G, and in both scenarios, the Memory Usage is approximately at 74.72GB (gpu_memory_utilization=90%). I'm also curious about the reasons behind such high memory consumption.

SardarArslan commented 5 months ago

Same issue here, online inference is almost half as fast as offline inference.

iamhappytoo commented 3 months ago

Hello @irasin, is there some new thoughts on this issue? I encounter the same thing, the speed is ~0.49 of the offline batch in tps. Much appreciated for any suggestions!

rbgo404 commented 2 months ago

Hello @irasin, is there some new thoughts on this issue? I encounter the same thing, the speed is ~0.49 of the offline batch in tps. Much appreciated for any suggestions!

I have observed the same issue

SamComber commented 2 months ago

+1 have observed this also, currently just living with it.

SardarArslan commented 2 months ago

I think it's slower due to internet latency.

On Mon, 15 Apr 2024, 15:48 Sam Comber, @.***> wrote:

+1 have observed this also, currently just living with it.

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/2019#issuecomment-2056529923, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATYE26B7Q2J3HOMDG6AQZNTY5OV57AVCNFSM6AAAAABAPWXDNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJWGUZDSOJSGM . You are receiving this because you commented.Message ID: @.***>

rbgo404 commented 2 months ago

I think it's slower due to internet latency. On Mon, 15 Apr 2024, 15:48 Sam Comber, @.> wrote: +1 have observed this also, currently just living with it. — Reply to this email directly, view it on GitHub <#2019 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATYE26B7Q2J3HOMDG6AQZNTY5OV57AVCNFSM6AAAAABAPWXDNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJWGUZDSOJSGM . You are receiving this because you commented.Message ID: @.>

Have you done any benchmark on this?

xiejibing commented 3 weeks ago

Confused +1