[Feature] Use Embedding/Generation Model to get its Generation/Emebedding

Checklist

[X] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[X] 2. Please use English, otherwise it will be closed.

Motivation

Currently, SGLang supports getting generation content (chat completion) from generative models and embedding from embedding models. But theoretically, we can get embedding/generation from both embedding/generation models.

Something should be stressed that even we can do this, it's not usefully in practice.

The key differences between generation and embedding models primarily stem from their post-training specialization, leading to a loss of some capabilities, akin to catastrophic forgetting. Embedding models focus on compressing information into a fixed-dimensional vector space, discouraging long-term predictions, while generation models aim to reduce uncertainty in the probability space, addressing both compression of current information and future uncertainties.

The user draws a parallel between these tasks and the distinction between non-autoregressive and autoregressive models. They suggest that embedding models should be decoded with methods like MCMC rather than token-by-token approaches.

The community tends to treat generation and embedding as separate tasks, each with its own specialized models and research focus. While the idea of a model that can handle both tasks is attractive, practical challenges make it difficult to implement. The user also notes that OpenAI’s recommendation to fine-tune models for specific applications feels overly product-oriented and not aligned with the concept of AGI.

Related resources

https://github.com/sgl-project/sglang/pull/1186

sgl-project / sglang

[Feature] Use Embedding/Generation Model to get its Generation/Emebedding #1200

Checklist

Motivation

Related resources