[Question]: analysis of attention scores (too sparse)

wiluen commented 1 month ago

Describe the issue

I want to ask a general question. When analyzin attention score, I feel that my attention score is quite sparse and their values are also very low. I cannot obtain any valuable information, such as more attention on what kinds of tokens. Considering that a model has n layers and m and attention head, how can I gain some valuable insights? my task is to extracting important information from the input I provide

wiluen commented 1 month ago

do difference attention head\ different layers matters ?

iofu728 commented 1 month ago

Hi @wiluen, thanks for your question.

If I understand correctly, you're asking how to determine which parts of the attention weights are more important to preserve, especially in highly sparse scenarios.

In MInference, we don’t perform fine-tuned adjustments. Most heads use the same kernel sparsity rate. However, we replace block sparsity with a higher-budget VS pattern for certain heads, as we found that allocating more resources to these heads can significantly improve performance.
There are several related works exploring this direction, including:
- KV cache compression: PyramidKV, RetrievalAttention
- Sparse Attention: RetrievalHead, DuoAttention, RazorAttention

You can evaluate the impact of small sparse attention weight values in different heads from an end-to-end perspective to measure their importance.

I hope this helps!

microsoft / MInference

[Question]: analysis of attention scores (too sparse) #82

Describe the issue