[X] 2. Please use English, otherwise it will be closed.
Motivation
Currently, SGLang supports getting generation content (chat completion) from generative models and embedding from embedding models. But theoretically, we can get embedding/generation from both embedding/generation models.
Something should be stressed that even we can do this, it's not usefully in practice.
The key differences between generation and embedding models primarily stem from their post-training specialization, leading to a loss of some capabilities, akin to catastrophic forgetting. Embedding models focus on compressing information into a fixed-dimensional vector space, discouraging long-term predictions, while generation models aim to reduce uncertainty in the probability space, addressing both compression of current information and future uncertainties.
The user draws a parallel between these tasks and the distinction between non-autoregressive and autoregressive models. They suggest that embedding models should be decoded with methods like MCMC rather than token-by-token approaches.
The community tends to treat generation and embedding as separate tasks, each with its own specialized models and research focus. While the idea of a model that can handle both tasks is attractive, practical challenges make it difficult to implement. The user also notes that OpenAI’s recommendation to fine-tune models for specific applications feels overly product-oriented and not aligned with the concept of AGI.
Checklist
Motivation
Currently, SGLang supports getting generation content (chat completion) from generative models and embedding from embedding models. But theoretically, we can get embedding/generation from both embedding/generation models.
Something should be stressed that even we can do this, it's not usefully in practice.
Related resources
https://github.com/sgl-project/sglang/pull/1186