Support for offline batch mode with local models?

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

https://sglang.readthedocs.io/en/latest/

Apache License 2.0

5.43k stars 399 forks source link

Support for offline batch mode with local models? #76

Closed niklub closed 8 months ago

niklub commented 8 months ago

Hello, guys, Any plans to support offline batch inference mode with local models, without spinning up an additional server? similar to what is implemented in vLLM. It would be way easier to use. Thanks!

merrymercy commented 8 months ago

You can launch a server within a python script. See the following examples https://github.com/sgl-project/sglang/blob/cd3ccb2ed7aaeaa8f56acd467af9ad8fb482f465/examples/quick_start/srt_example_chat.py#L13-L15

https://github.com/sgl-project/sglang/blob/cd3ccb2ed7aaeaa8f56acd467af9ad8fb482f465/examples/usage/async.py#L21-L24

Do they meet your requirements?

niklub commented 8 months ago

That's definitely can be the solution from the code interfacing standpoint. However, it still launches an additional server in the background process, with less ability to properly manage it. What would be ideal is a single process application that runs everything w/o client-server interaction involved. I assume it can also remove some additional overhead. Is that feasible, or am I mistaken in understanding the general architecture?

merrymercy commented 8 months ago

Our architecture does not allow this. We launched multiple processes to parallelize some parts (e.g., tokenization, model forward, detokenization). Therefore, we do not have a single-process runtime. This architecture design is optimized for throughput. It does incur a very small overhead for latency, but it is negligible in almost all our use cases.

The downside is what you said, the process management can be challenging in Python. Is it possible to add more APIs in Runtime to make the management easier?

Regarding the vLLM interface, it also needs to launch multiple background processes if tensor parallelism is used. Overall, I believe our design made a reasonable trade-off. If you have better ideas, we can discuss more!

niklub commented 8 months ago

I see, thanks for the clarification! I don't have any ideas yet, but I'll be more than happy to share if any come to mind.