microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
680 stars 23 forks source link

[Question]: Question about the settings of vertical_size and slash_size in vertical_and_slash pattern #47

Open ALUKErnel opened 1 month ago

ALUKErnel commented 1 month ago

Describe the issue

Thanks for the great work!

According to my own implementations, here are some questions about the settings of _verticalsize and _slashsize. It seems that the larger vertical_size and slash_size, the performances (i.e. ppl in my experiment) are not promised better. Intuitively, with the increase of vertical and slash size, more weights in attention matrix are reserved (as well as the corresponding kv cache), the performance should have been better. However, my experimental results are sometimes against this. And it seems that there is a trade-off between v_size and s_size, in my experiments s_size has a larger impact on the performances.

I wonder in your empirical experiments which explore the setting v_size and s_size (i.e. (30, 800) (500, 700) (1000, 6096)....), is the performance better with the increase of v_size and s_size, or is there any other specific pattern?

Looking forward to your reply!

ALUKErnel commented 1 month ago

Supplements (if necessary): I conduct the experiments on llama2-7B, with the sequence length 4k, last_q 64 (inference). The metric is ppl on pg19. The experiments aim to explore the impact on vertical_size and slash_size on the performance only (without considering the efficiency currently).

iofu728 commented 1 month ago

Hi @ALUErnel, thanks for your great question.

  1. Generally speaking, different heads have varying sensitivity to vertical size and slash size. Some heads, such as the one with the config (3500, 100), require a larger vertical size rather than slash size.
  2. Secondly, PPL in long-context scenarios is not an effective indicator. For PPL, local windows are very important and almost exclusively related to local windows. This is why the StreamingLLM method shows such good results in PPL tests. For downstream tasks, I would recommend using KV retrieval or Needle In A Haystack (though it is simple, it can reflect the capabilities of different context windows and depth).
ALUKErnel commented 1 month ago

Thanks for your response! : ) I am also wondering whether the y-axis in Figure 5 represents the log of perplexity (i.e., e^{8-10}) or the actual perplexity values (i.e., 8-10)?

截屏2024-07-18 11 28 53
iofu728 commented 1 month ago

Thanks for your response! : ) I am also wondering whether the y-axis in Figure 5 represents the log of perplexity (i.e., e^{8-10}) or the actual perplexity values (i.e., 8-10)? 截屏2024-07-18 11 28 53

Hi @ALUKErnel, the PPL results are after exp. you can refer to this code https://github.com/microsoft/MInference/blob/main/experiments/ppl/run_ppl.py#L138.