Support SGLang RadixAttention

huggingface / text-generation-inference

Large Language Model Text Generation Inference

http://hf.co/docs/text-generation-inference

Apache License 2.0

8.36k stars 948 forks source link

Support SGLang RadixAttention #1466

Closed XianglongTan closed 4 months ago

XianglongTan commented 5 months ago

Feature request

This framework use an advance prefix cache technique to accelerate inference. It brings more than 30% improvement compare to TGI without speculate. Is it possible to integrated into TGI? https://github.com/sgl-project/sglang

Motivation

Accelerate inference without losing throughput

Your contribution

I can provide the bugs of SGLang I have seen and maybe how I fix it.

merrymercy commented 5 months ago

Could you tell me what bugs you find with sglang? Perhaps by submitting an issue to sglang repo?

XianglongTan commented 5 months ago

Could you tell me what bugs you find with sglang? Perhaps by submitting an issue to sglang repo?

When combined with guidance, SGLang backend stops generating in the middle. I am still figuring out the reason.

merrymercy commented 5 months ago

@XianglongTan Would you mind sharing a reproducible example? Also, could you try the SGLang frontend instead of guidance?

XianglongTan commented 5 months ago

https://github.com/sgl-project/sglang/issues/85 Our team submit a relative issue. When we use guidance, we are using stop token/text to decide whether a list of json is fully generated. But the response from SGLang does not retain the stop. So guidance may behave abnormally.

merrymercy commented 5 months ago

@XianglongTan Are you interested in trying the SGLang built-in interface for your JSON decode task? https://github.com/sgl-project/sglang/blob/9e037c822ccabaf593c0145a9d8377f177e22ff9/benchmark/json_decode_regex/bench_sglang.py#L32-L43

Also, we will soon add a new more pythonic interface for JSON decode based on Pydantic. We also implemented some additional runtime optimization to make the JSON decoding faster. Stay tuned!

merrymercy commented 5 months ago

@XianglongTan You are welcome to try this new JSON decode feature on the main branch. We landed some additional optimizations to make it faster https://github.com/sgl-project/sglang/tree/main?tab=readme-ov-file#json-decoding

XianglongTan commented 5 months ago

@merrymercy Very nice feature! We will have a try!

Narsil commented 5 months ago

Currently we're not looking at this kind of thing.

The first step would be simply to keep an LRU cache of the prompts so users could skip some prefill when continuing their conversation. However looking at our internal loads, prefill represents of very low amount of the compute, therefore it's unlikely implementing even LRU would yield anything significant (on top of making things like sticky sessions much harder to work with)

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.