[Usage]: Is there an option to obtain attention matrices during inference, similar to the output_attentions=True parameter in the transformers package?

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

27.05k stars 3.97k forks source link

[Usage]: Is there an option to obtain attention matrices during inference, similar to the output_attentions=True parameter in the transformers package? #7736

Open yuhkalhic opened 4 weeks ago

yuhkalhic commented 4 weeks ago

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

Feature Request: Access to Attention Matrices and/or KV-Cache during Inference I'm wondering if there's a way to obtain attention matrices or access the KV-Cache during inference with vLLM, similar to how the transformers package allows this with the output_attensions=True parameter or through the past_key_values attribute.

SpaceHunterInf commented 1 week ago

Upvote for this request, I'd like to visualise the attention matrix