Closed BirdChristopher closed 17 hours ago
Hi,
The threshold depends on the ratio of streaming heads (sparsity) you plan to use in your experiments. For instance, if you're aiming for 50% streaming heads, the threshold should correspond to the median of the gated values. You can find more details in this section of the code: duo_attn/utils.py, line 360.
Guangxuan
Incredible work! It is amazing result that your work successfully prove that "full attention" models can be deployed as "full attention + streaming attention" models with negligible quality loss. I'm curious about how many kv heads are identified as streaming heads when you conducted experiments. Could you please share the number?