lucidrains / FLASH-pytorch

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"
MIT License
344 stars 24 forks source link

rel_pos_bias in GAU #9

Open SunderlandAJ-1130 opened 1 year ago

SunderlandAJ-1130 commented 1 year ago

Hello @lucidrains, thanks for your generous sharing about this implementation. According to Figure 2 of the original paper, there is a rel_pos_bias(q, k) to obtain the final attention weights. Although I can find this function in your FLASH, this operation is missing in GAU. Could you please explain this question, or whether this operation is useless in GAU?

Thanks!

lucidrains commented 1 year ago

@SunderlandAJ-1130 yea no problem

do you want to try setting this keyword argument to True