[Usage]: Is there an option to obtain attention matrices during inference, similar to the output_attentions=True parameter in the transformers package? #7736
Feature Request: Access to Attention Matrices and/or KV-Cache during Inference
I'm wondering if there's a way to obtain attention matrices or access the KV-Cache during inference with vLLM, similar to how the transformers package allows this with the output_attensions=True parameter or through the past_key_values attribute.
Your current environment
How would you like to use vllm
Feature Request: Access to Attention Matrices and/or KV-Cache during Inference I'm wondering if there's a way to obtain attention matrices or access the KV-Cache during inference with vLLM, similar to how the transformers package allows this with the output_attensions=True parameter or through the past_key_values attribute.