Closed XianglongTan closed 4 months ago
Could you tell me what bugs you find with sglang? Perhaps by submitting an issue to sglang repo?
Could you tell me what bugs you find with sglang? Perhaps by submitting an issue to sglang repo?
When combined with guidance, SGLang backend stops generating in the middle. I am still figuring out the reason.
@XianglongTan Would you mind sharing a reproducible example? Also, could you try the SGLang frontend instead of guidance?
https://github.com/sgl-project/sglang/issues/85 Our team submit a relative issue. When we use guidance, we are using stop token/text to decide whether a list of json is fully generated. But the response from SGLang does not retain the stop. So guidance may behave abnormally.
@XianglongTan Are you interested in trying the SGLang built-in interface for your JSON decode task? https://github.com/sgl-project/sglang/blob/9e037c822ccabaf593c0145a9d8377f177e22ff9/benchmark/json_decode_regex/bench_sglang.py#L32-L43
Also, we will soon add a new more pythonic interface for JSON decode based on Pydantic. We also implemented some additional runtime optimization to make the JSON decoding faster. Stay tuned!
@XianglongTan You are welcome to try this new JSON decode feature on the main branch. We landed some additional optimizations to make it faster https://github.com/sgl-project/sglang/tree/main?tab=readme-ov-file#json-decoding
@merrymercy Very nice feature! We will have a try!
Currently we're not looking at this kind of thing.
The first step would be simply to keep an LRU cache of the prompts so users could skip some prefill when continuing their conversation. However looking at our internal loads, prefill represents of very low amount of the compute, therefore it's unlikely implementing even LRU would yield anything significant (on top of making things like sticky sessions much harder to work with)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Feature request
This framework use an advance prefix cache technique to accelerate inference. It brings more than 30% improvement compare to TGI without speculate. Is it possible to integrated into TGI? https://github.com/sgl-project/sglang
Motivation
Accelerate inference without losing throughput
Your contribution
I can provide the bugs of SGLang I have seen and maybe how I fix it.