Closed UbeCc closed 2 weeks ago
Looking forward to SGLang supporting this feature! Thks!
Thanks for the request. Will prioritize it for this week. @BeyonderXX be
We do have a reward model implementation. https://github.com/sgl-project/sglang/blob/36078fb24760de837b95f4bea87ede00c0fd91e8/python/sglang/srt/models/llama_classification.py#L30
and how you use it https://github.com/sgl-project/sglang/blob/main/scripts/deprecated/test_httpserver_classify.py
but the API is very hacky. @Ying1123 will work on a better API this week.
Thanks for your help and sorry for my late reply! We deploy our 32b reward model with single copy with the following code.
https://github.com/sgl-project/sglang/blob/main/scripts/deprecated/test_httpserver_classify.py
However, we did not find noticeable efficiency compared with directly using transformers
package.
I wonder whether there is something mismatch?
@UbeCc Sorry for replying late. Let's see this later,
@UbeCc SGLang mostly accelerates the batched decoding so your results are kind of expected.
However, there is still chances for accelerating.
I still think with a large batch or many shared prefix, sglang should still provide speedup
supported by https://github.com/sgl-project/sglang/pull/1525
@UbeCc SGLang mostly accelerates the batched decoding so your results are kind of expected.
However, there is still chances for accelerating.
- What is your batch size? Do you send a large enough number of requests to the server?
- How long are your prompts? Are their lengths very different?
- Do they have shared prefixes?
I still think with a large batch or many shared prefix, sglang should still provide speedup
I'm using coroutines with multiprocessing. Currently, the product of them is 1024. It seems enough for a server of 8 replicas. As we did not do precise length control, they are somehow different, varing from 100 to 2048 tokens. And do not share prefixes cause they do not share specific template.
I think the lack of prefixes is the reason for lack of efficiency improvement.
Thanks for your reply!
Checklist
Motivation
Does SGLang support rapid deployment of RM services? Or convenient custom APIs? It seems that currently there are only chat/completion/embedding APIs. As a newcomer to inference acceleration, any help would be beneficial.
Related resources
copied from https://github.com/vllm-project/vllm/issues/6620, same demand