Open zhenlan0426 opened 6 months ago
I can work on this if someone can point me in the right direction and let me know what part of the code base I should look at.
Hi @zhenlan0426, have you made any progress on this? I would also need the hidden_states returned as part of the output of the generate() method.
It seems that hidden states are stored in the inference code: https://github.com/vllm-project/vllm/blob/c2462129521a64b62ace77b28641d2e3bec5831c/vllm/worker/model_runner.py#L774C17-L774C37
🚀 The feature, motivation and pitch
I am generating mutiple samples from the same prompt as in self-consistent Chain of Thought (CoT). I have trained a separate evaluation head (using the same backbone as the LLM generator) to assess the quality of each sample. Without the option to return hidden states, I would need to perform an additional forward pass to obtain them. The majority of the computational work is already done when the VLLM generates the samples. Having this option would significantly save on inference time. If I were to implement this, could someone point me in the right direction regarding which parts of the source code I should look at? Any pointers would be appreciated. Thanks!
Alternatives
No response
Additional context
No response