microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
697 stars 25 forks source link

[Question]: Confusion about Optimal Search Pattern Configuration #64

Open Dianaia opened 1 month ago

Dianaia commented 1 month ago

Confusion about Optimal Search Pattern Configuration

First of all, thank you for your outstanding research. I noticed that in Appendix E of the paper, it is mentioned that "according to the ablation study, using only the Vertical-Slash pattern significantly impacts performance in highly dynamic tasks like KV retrieval." However, the model configuration provided in the repository still uses the Vertical-Slash pattern exclusively. You mentioned in other comments that "the search_pattern function reroutes to vertical_and_slash because our tests have shown that this setting offers better generalization and efficiency across different context windows and tasks." This seems to contradict the conclusion given in the paper, which leaves me somewhat confused. Could you please clarify how we should set the optimal search pattern in practice?

iofu728 commented 1 month ago

Hi @Dianaia,

Thanks for your feedback and great question.

Actually, there's no contradiction.

  1. As shown in Figure 11, the majority (>90%) of the patterns we found through our search are "vertical and slash" patterns.
  2. As shown in Table 4, using only the "vertical and slash" pattern results in minimal performance differences across most tasks. The most crucial aspect here is the slash pattern.
  3. Based on our tests, using only the "vertical and slash" pattern and fine-tuning some compression ratios (e.g., increasing the slash lines in certain heads) can lead to better generalization across different context windows and tasks.

I recommend following our instructions in practical use and employing the "vertical and slash" pattern entirely. Our tests have shown that this approach performs well across different models, sizes, and tasks.

Dianaia commented 1 month ago

Got it, I understand now. Thank you again for your explanation and outstanding work.