Hello @lucidrains, thanks for your generous sharing about this implementation. According to Figure 2 of the original paper, there is a rel_pos_bias(q, k) to obtain the final attention weights. Although I can find this function in your FLASH, this operation is missing in GAU. Could you please explain this question, or whether this operation is useless in GAU?
Hello @lucidrains, thanks for your generous sharing about this implementation. According to Figure 2 of the original paper, there is a rel_pos_bias(q, k) to obtain the final attention weights. Although I can find this function in your FLASH, this operation is missing in GAU. Could you please explain this question, or whether this operation is useless in GAU?
Thanks!