Closed CharvinMei closed 3 days ago
Hi. Thanks for your question.
560 is the total number of blocks (including attention and MLP) in the cache steps. For example, if we have 10 steps to be cached, then we would have 10(steps) 28(layers) 2(attention for 1 and mlp for 1) = 560. And 248 is calculated as the blocks removed here.
Thank you for your answer.
I think this is a good article. However, I have a question for you. What does the Remove Ratio 248/560 mentioned in Table 9 refer to—does it mean the number of Blocks removed or the number of attention layers or MLP layers removed within a single Block?