sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.98k stars 496 forks source link

[Feature] support min_p sampling #1071

Closed 81549361 closed 2 months ago

81549361 commented 3 months ago

Motivation

Motivation The min_p sampling parameter is becoming quite popular. It's conceptually simple and "makes sense", and (at least anecdotally, according to opinions of many model fine-tuners and users in the LocalLlama community) it tends to perform better than the usual top_p+top_k approach. You can see the readmes of HF repositories of many new model finetunes/merges recommend to use min_p instead of top_p and top_k.

Some of the code has been implemented in flashinfer. https://github.com/flashinfer-ai/flashinfer/pull/422

Related resources vLLM: https://github.com/vllm-project/vllm/blob/8ea5e44a435e8731fd6f5ba4c329dd112752532a/vllm/sampling_params.py#L64C9-L66C57 min_p: Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

So e.g. a min_p of 0.07 means that if a token that is less than 7% of the probability of the highest-probability token, it will be disqualified. A min_p of 0.5 would mean that if a token is not at least half the probability of the highest-probability token, then it is disqualified. Said another way, min_p allows you to set a minimum fraction of the most likely token's probability, else the token cannot be sampled.

https://github.com/vllm-project/vllm/pull/1642 https://github.com/oobabooga/text-generation-webui/pull/4449 https://github.com/ggerganov/llama.cpp/pull/3841 Please see the above links for more info.

image

Related resources

No response

zhyncs commented 3 months ago

Yes, the corresponding kernel has already been implemented in FlashInfer. I think it shouldn't be too difficult to integrate into SGLang. Are you interested in submitting a PR? We highly welcome contributions!

81549361 commented 3 months ago

是的,相应的内核已经在 FlashInfer 中实现了。我认为集成到 SGLang 中应该不会太难。你有兴趣提交 PR 吗?我们非常欢迎贡献!

I'd love to help but I'm a newbie and I only know how to add min_p sampling but don't know how to use it with top k and top p at the same time. https://github.com/81549361/sglang/commit/79e8e8d7ee003a5e8bf8089c1414a9d36aa176d9

merrymercy commented 2 months ago

closed by #1167