Closed Palmik closed 8 months ago
Current worker is now 10x faster for streaming, give it a try :)
Hi @alpayariyak ,
Thank you for the update!!
For our use case (streaming vLLM's output to the end user in real-time with typing effect), the streaming is still quite slow: Time to finish increases from 5 second to 10 second if we stream every token. If we only stream every 2 second, the time to finish is faster, but still slow-ish, and the end result is not real time.
We wonder if that is a limitation of Runpod's streaming architecture (if so, it will really prevent us from moving from AWS to Runpod Serverless), and wonder if you have any suggestions.
Thanks!
Hi @alpayariyak ,
Thank you for the update!!
For our use case (streaming vLLM's output to the end user in real-time with typing effect), the streaming is still quite slow: Time to finish increases from 5 second to 10 second if we stream every token. If we only stream every 2 second, the time to finish is faster, but still slow-ish, and the end result is not real time.
We wonder if that is a limitation of Runpod's streaming architecture (if so, it will really prevent us from moving from AWS to Runpod Serverless), and wonder if you have any suggestions.
Thanks!
Hi @changun, thank you for the feedback! We're currently working on fixing a bug that makes the last response 100x longer when streaming and only contains the change of the status from IN_PROGRESS to COMPLETED. To combat this, when streaming, stop making the requests once the finished
value in the streaming output is True
, it should eliminate this and reach same speed as non-streaming.
When using STREAMING=True, I got 8-10x slower response time (time to full response) than when using STREAMING=True. It would be great to investigate this discrepancy and to document this regression in the meantime.