mit-han-lab / duo-attention

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
MIT License
252 stars 10 forks source link

Curious about the threshold tau selected to conduct experiment #5

Closed BirdChristopher closed 17 hours ago

BirdChristopher commented 2 days ago

Incredible work! It is amazing result that your work successfully prove that "full attention" models can be deployed as "full attention + streaming attention" models with negligible quality loss. I'm curious about how many kv heads are identified as streaming heads when you conducted experiments. Could you please share the number?

Guangxuan-Xiao commented 1 day ago

Hi,

The threshold depends on the ratio of streaming heads (sparsity) you plan to use in your experiments. For instance, if you're aiming for 50% streaming heads, the threshold should correspond to the median of the gated values. You can find more details in this section of the code: duo_attn/utils.py, line 360.

Guangxuan