thu-ml / SageAttention

Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
BSD 3-Clause "New" or "Revised" License
400 stars 17 forks source link

Would support other headdim #17

Open v4if opened 3 weeks ago

v4if commented 3 weeks ago

Would support other headdim? Like 512.

jason-huang03 commented 3 weeks ago

In the upcoming release, we will include kernels for head dimension of 256. However, models with a head dimension of 256 are already quite rare (only gemma-2 2b/9b as far as I know), and those with 512 are even more uncommon. Could you provide examples of models that use a head dimension of 512? This would give us a stronger incentive to optimize this type of kernel.

v4if commented 3 weeks ago

In the upcoming release, we will include kernels for head dimension of 256. However, models with a head dimension of 256 are already quite rare (only gemma-2 2b/9b as far as I know), and those with 512 are even more uncommon. Could you provide examples of models that use a head dimension of 512? This would give us a stronger incentive to optimize this type of kernel.

tks. internal model.