[Feature] Support RM API

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

https://sgl-project.github.io/

Apache License 2.0

5.94k stars 488 forks source link

[Feature] Support RM API #1384

Closed UbeCc closed 2 weeks ago

UbeCc commented 1 month ago

Checklist

[X] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[X] 2. Please use English, otherwise it will be closed.

Motivation

Does SGLang support rapid deployment of RM services? Or convenient custom APIs? It seems that currently there are only chat/completion/embedding APIs. As a newcomer to inference acceleration, any help would be beneficial.

Related resources

copied from https://github.com/vllm-project/vllm/issues/6620, same demand

BeyonderXX commented 1 month ago

Looking forward to SGLang supporting this feature! Thks!

zhaochenyang20 commented 1 month ago

Thanks for the request. Will prioritize it for this week. @BeyonderXX be

merrymercy commented 1 month ago

We do have a reward model implementation. https://github.com/sgl-project/sglang/blob/36078fb24760de837b95f4bea87ede00c0fd91e8/python/sglang/srt/models/llama_classification.py#L30

and how you use it https://github.com/sgl-project/sglang/blob/main/scripts/deprecated/test_httpserver_classify.py

but the API is very hacky. @Ying1123 will work on a better API this week.

UbeCc commented 1 month ago

Thanks for your help and sorry for my late reply! We deploy our 32b reward model with single copy with the following code.

https://github.com/sgl-project/sglang/blob/main/scripts/deprecated/test_httpserver_classify.py

However, we did not find noticeable efficiency compared with directly using transformers package. I wonder whether there is something mismatch?

zhaochenyang20 commented 4 weeks ago

@UbeCc Sorry for replying late. Let's see this later,

merrymercy commented 2 weeks ago

@UbeCc SGLang mostly accelerates the batched decoding so your results are kind of expected.

However, there is still chances for accelerating.

What is your batch size? Do you send a large enough number of requests to the server?
How long are your prompts? Are their lengths very different?
Do they have shared prefixes?

I still think with a large batch or many shared prefix, sglang should still provide speedup

merrymercy commented 2 weeks ago

supported by https://github.com/sgl-project/sglang/pull/1525

UbeCc commented 2 weeks ago

@UbeCc SGLang mostly accelerates the batched decoding so your results are kind of expected.

However, there is still chances for accelerating.

What is your batch size? Do you send a large enough number of requests to the server?

How long are your prompts? Are their lengths very different?

Do they have shared prefixes?

I still think with a large batch or many shared prefix, sglang should still provide speedup

I'm using coroutines with multiprocessing. Currently, the product of them is 1024. It seems enough for a server of 8 replicas. As we did not do precise length control, they are somehow different, varing from 100 to 2048 tokens. And do not share prefixes cause they do not share specific template.

I think the lack of prefixes is the reason for lack of efficiency improvement.

Thanks for your reply!