[Feature]: Support for Controlled Decoding

🚀 The feature, motivation and pitch

Contrastive Decoding (Li et al., 2022) is a decoding strategy that contrasts the log probabilities of two or more models at each token to shift the token distribution for better performance or less harmful outputs (Liu et al., 2021). Similar works are seen in Proxy-tuning (Liu et al., 2024), Emulator on aligned models (Mitchell et al., 2023), improving reasoning tasks (O'Brien et al., 2023) and Test-time alignment (Zhu et al., 2024). This approach also facilitates the recent interest in test-time alignment (Xu et al., 2024), where a token-level reward model is used to generate partial rewards at each token decoding stage to assist generation.

welcome for any contribution!

I am currently working on the implementation, and any contributions would be highly appreciated. The initial idea is similar to the speculative decoding method under spec_decode/, where two or more models are loaded into the GPU and perform inference at each timestep. More details will be shared soon!

Reference

Contrastive Decoding: Open-ended Text Generation as Optimization
DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts
Tuning Language Models by Proxy
An Emulator for Fine-Tuning Large Language Models using Small Language Models
GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
Contrastive Decoding Improves Reasoning in Large Language Models

Alternatives

No response

Additional context

No response

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm