vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.14k stars 4.35k forks source link

[Feature]: Support for Controlled Decoding #9541

Open simonucl opened 6 days ago

simonucl commented 6 days ago

🚀 The feature, motivation and pitch

Contrastive Decoding (Li et al., 2022) is a decoding strategy that contrasts the log probabilities of two or more models at each token to shift the token distribution for better performance or less harmful outputs (Liu et al., 2021). Similar works are seen in Proxy-tuning (Liu et al., 2024), Emulator on aligned models (Mitchell et al., 2023), improving reasoning tasks (O'Brien et al., 2023) and Test-time alignment (Zhu et al., 2024). This approach also facilitates the recent interest in test-time alignment (Xu et al., 2024), where a token-level reward model is used to generate partial rewards at each token decoding stage to assist generation.

welcome for any contribution!

I am currently working on the implementation, and any contributions would be highly appreciated. The initial idea is similar to the speculative decoding method under spec_decode/, where two or more models are loaded into the GPU and perform inference at each timestep. More details will be shared soon!

Reference

Alternatives

No response

Additional context

No response

Before submitting a new issue...

simonucl commented 4 days ago

The minimal workable code is done!

The development branch is in https://github.com/simonucl/vllm/tree/contrastive-decoding, with workable code under tests/contrast_decode/run.py. It's still WIP and any feedback will be appreciated! Also, feel free to request any functionality that fits your needs.