runpod-workers / worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
MIT License
213 stars 81 forks source link

Slow streaming #76

Closed motorbike158 closed 2 months ago

motorbike158 commented 2 months ago

Streaming is extremely slow. The intended effect is to have it look like its typing of course, but instead its just loading in laggy chunks. A GPU pod works fine, just serverless endpoint that causes this. Unforunately until this is better we're forced to use HuggingFace serverless.

alpayariyak commented 2 months ago

Hi, can you expand on this? Were you streaming with OpenAI compatibility or RunPod's streaming feature, were you running the worker code on a pod or vLLM directly?

For the vLLM worker, I implemented dynamic batching for streaming tokens, which maximizes concurrent throughput while providing time-to-first-token similar to having no batching at all - it seems like you would like each streamed batch of tokens to be smaller to have a rate closer to typing, so you can simply set the associated env vars/request params to lower values:

Name Default Type/Choices Description

Streaming Batch Size Settings:
| DEFAULT_BATCH_SIZE | 50 | int |Default and Maximum batch size for token streaming to reduce HTTP calls. | | DEFAULT_MIN_BATCH_SIZE | 1 | int |Batch size for the first request, which will be multiplied by the growth factor every subsequent request. | | DEFAULT_BATCH_SIZE_GROWTH_FACTOR | 3 | float |Growth factor for dynamic batch size. | The way this works is that the first request will have a batch size of DEFAULT_MIN_BATCH_SIZE, and each subsequent request will have a batch size of previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR. This will continue until the batch size reaches DEFAULT_BATCH_SIZE. E.g. for the default values, the batch sizes will be 1, 3, 9, 27, 50, 50, 50, .... You can also specify this per request, with inputs max_batch_size, min_batch_size, and batch_size_growth_factor. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker |

Note that this will lower your performance if you are planning to serve more than 1 request at a time. Though the "laggy chunks" might seem slower, the throughput of tokens per second is actually much higher. Performance-wise, the ideal solution is that you control the speed at which you show the received the tokens to the user, which would allow you to maximize the throughput while achieving a typing effect that you can set to whatever speed you want.