Closed niklub closed 8 months ago
You can launch a server within a python script. See the following examples https://github.com/sgl-project/sglang/blob/cd3ccb2ed7aaeaa8f56acd467af9ad8fb482f465/examples/quick_start/srt_example_chat.py#L13-L15
Do they meet your requirements?
That's definitely can be the solution from the code interfacing standpoint. However, it still launches an additional server in the background process, with less ability to properly manage it. What would be ideal is a single process application that runs everything w/o client-server interaction involved. I assume it can also remove some additional overhead. Is that feasible, or am I mistaken in understanding the general architecture?
Our architecture does not allow this. We launched multiple processes to parallelize some parts (e.g., tokenization, model forward, detokenization). Therefore, we do not have a single-process runtime. This architecture design is optimized for throughput. It does incur a very small overhead for latency, but it is negligible in almost all our use cases.
The downside is what you said, the process management can be challenging in Python. Is it possible to add more APIs in Runtime
to make the management easier?
Regarding the vLLM interface, it also needs to launch multiple background processes if tensor parallelism is used. Overall, I believe our design made a reasonable trade-off. If you have better ideas, we can discuss more!
I see, thanks for the clarification! I don't have any ideas yet, but I'll be more than happy to share if any come to mind.
Hello, guys, Any plans to support offline batch inference mode with local models, without spinning up an additional server? similar to what is implemented in vLLM. It would be way easier to use. Thanks!