Open AlekseyKorshuk opened 1 month ago
+1!
+1
!!!
having this feature would be nice, indeed
Great suggestions. Let's prioritize this one. I can share some ideas and pointers.
Since many parts of the existing code rely on the concept of "input_ids: List[int]," it is not easy to fully change all of them, as this will create many problematic "if/else" conditions. I think one possible implementation idea is to create some random fake "input_ids" to make most of the existing code runnable. Then, during the actual forward pass, we can feed input_embeds
instead of calling the embedding layer to encode input_ids
.
You can learn more about this idea by looking at how the existing Llava implementation directly feeds input_embeds
into the underlying Llama:
https://github.com/sgl-project/sglang/blob/0736b270202696b8f865e2915aadc36d3d51811b/python/sglang/srt/models/llava.py#L241-L243
https://github.com/sgl-project/sglang/blob/0736b270202696b8f865e2915aadc36d3d51811b/python/sglang/srt/models/llama2.py#L258-L261
The inference of a request starts with GenerateReqInput
from the HTTP server, then it will go through several important classes: TokenizerManager
, ModelTpServer
, ModelRunner
, Req
, nferBatch
. To implement your change, we need to update these places.
TokenizerManager
https://github.com/sgl-project/sglang/blob/0736b270202696b8f865e2915aadc36d3d51811b/python/sglang/srt/managers/tokenizer_manager.py#L142-L148Req
, record the input_embeds
. Maybe here is also a good place to generate the fake input_ids mentioned above. https://github.com/sgl-project/sglang/blob/0736b270202696b8f865e2915aadc36d3d51811b/python/sglang/srt/managers/controller/tp_worker.py#L263InferBatch
. In SGLang, "prefill" is also called "extend". https://github.com/sgl-project/sglang/blob/0736b270202696b8f865e2915aadc36d3d51811b/python/sglang/srt/managers/controller/infer_batch.py#L313input_embeds
to the model, https://github.com/sgl-project/sglang/blob/0736b270202696b8f865e2915aadc36d3d51811b/python/sglang/srt/managers/controller/model_runner.py#L295-L309This is my rough idea. I haven't implemented it yet, so there may be some mistakes. I hope it is helpful.
@AlekseyKorshuk any updates?
Last week was quite busy for me, so unfortunately have not started yet
Motivation
I propose to add
input_embeds
as an optional input to the generation params.Why is this important
Nowadays there are a lot of Vision Language Models (VLMs) and they all have similar architecture: vision tower, projector, LLM. This means vision_tower+projector just prepares embeddings for "image" tokens. So why not allow model developers to handle by themselves the preparation of
input_embeds
for the LLM? Lots of new models tend to allow the user to work with bounding boxes and segmentation masks like PaliGemma and Florence, making it quite complicated to add different processors and conversation templates to the codebase. By allowing the user to provideinput_embeds
instead of list of messages or text prompts, you reduce your own headache in the future. Another point is that VLM developers can focus on caching image embeddings while building on top of the SGLang, allowing even higher throughput.vLLM users required this feature long time ago and this topic gained a lot of positive attention from the community:
This unique feature will make the SGLang the main framework for all VLMs.
I am happy to help implement this if you direct me in the codebase and thank you for your time and consideration đŸ¤—
Proposed usages
Related resources