Open ALUKErnel opened 1 month ago
Supplements (if necessary): I conduct the experiments on llama2-7B, with the sequence length 4k, last_q 64 (inference). The metric is ppl on pg19. The experiments aim to explore the impact on vertical_size and slash_size on the performance only (without considering the efficiency currently).
Hi @ALUErnel, thanks for your great question.
Thanks for your response! : ) I am also wondering whether the y-axis in Figure 5 represents the log of perplexity (i.e., e^{8-10}) or the actual perplexity values (i.e., 8-10)?
Thanks for your response! : ) I am also wondering whether the y-axis in Figure 5 represents the log of perplexity (i.e., e^{8-10}) or the actual perplexity values (i.e., 8-10)?
Hi @ALUKErnel, the PPL results are after exp. you can refer to this code https://github.com/microsoft/MInference/blob/main/experiments/ppl/run_ppl.py#L138.
Describe the issue
Thanks for the great work!
According to my own implementations, here are some questions about the settings of _verticalsize and _slashsize. It seems that the larger vertical_size and slash_size, the performances (i.e. ppl in my experiment) are not promised better. Intuitively, with the increase of vertical and slash size, more weights in attention matrix are reserved (as well as the corresponding kv cache), the performance should have been better. However, my experimental results are sometimes against this. And it seems that there is a trade-off between v_size and s_size, in my experiments s_size has a larger impact on the performances.
I wonder in your empirical experiments which explore the setting v_size and s_size (i.e. (30, 800) (500, 700) (1000, 6096)....), is the performance better with the increase of v_size and s_size, or is there any other specific pattern?
Looking forward to your reply!