runpod-workers / worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
MIT License
220 stars 85 forks source link

PSA: Streaming is slow #13

Closed Palmik closed 8 months ago

Palmik commented 10 months ago

When using STREAMING=True, I got 8-10x slower response time (time to full response) than when using STREAMING=True. It would be great to investigate this discrepancy and to document this regression in the meantime.

alpayariyak commented 8 months ago

Current worker is now 10x faster for streaming, give it a try :)

changun commented 8 months ago

Hi @alpayariyak ,

Thank you for the update!!

For our use case (streaming vLLM's output to the end user in real-time with typing effect), the streaming is still quite slow: Time to finish increases from 5 second to 10 second if we stream every token. If we only stream every 2 second, the time to finish is faster, but still slow-ish, and the end result is not real time.

We wonder if that is a limitation of Runpod's streaming architecture (if so, it will really prevent us from moving from AWS to Runpod Serverless), and wonder if you have any suggestions.

Thanks!

alpayariyak commented 8 months ago

Hi @alpayariyak ,

Thank you for the update!!

For our use case (streaming vLLM's output to the end user in real-time with typing effect), the streaming is still quite slow: Time to finish increases from 5 second to 10 second if we stream every token. If we only stream every 2 second, the time to finish is faster, but still slow-ish, and the end result is not real time.

We wonder if that is a limitation of Runpod's streaming architecture (if so, it will really prevent us from moving from AWS to Runpod Serverless), and wonder if you have any suggestions.

Thanks!

Hi @changun, thank you for the feedback! We're currently working on fixing a bug that makes the last response 100x longer when streaming and only contains the change of the status from IN_PROGRESS to COMPLETED. To combat this, when streaming, stop making the requests once the finished value in the streaming output is True, it should eliminate this and reach same speed as non-streaming.