microsoft / MInference

[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
798 stars 38 forks source link

[Question]: analysis of attention scores (too sparse) #82

Open wiluen opened 1 month ago

wiluen commented 1 month ago

Describe the issue

I want to ask a general question. When analyzin attention score, I feel that my attention score is quite sparse and their values are also very low. I cannot obtain any valuable information, such as more attention on what kinds of tokens. Considering that a model has n layers and m and attention head, how can I gain some valuable insights? my task is to extracting important information from the input I provide

wiluen commented 1 month ago

do difference attention head\ different layers matters ?

iofu728 commented 1 month ago

Hi @wiluen, thanks for your question.

If I understand correctly, you're asking how to determine which parts of the attention weights are more important to preserve, especially in highly sparse scenarios.

  1. In MInference, we don’t perform fine-tuned adjustments. Most heads use the same kernel sparsity rate. However, we replace block sparsity with a higher-budget VS pattern for certain heads, as we found that allocating more resources to these heads can significantly improve performance.

  2. There are several related works exploring this direction, including:

    • KV cache compression: PyramidKV, RetrievalAttention
    • Sparse Attention: RetrievalHead, DuoAttention, RazorAttention

You can evaluate the impact of small sparse attention weight values in different heads from an end-to-end perspective to measure their importance.

I hope this helps!