lucidrains / FLASH-pytorch

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"
MIT License
344 stars 24 forks source link

Cross-Attention? #4

Open amorehead opened 2 years ago

amorehead commented 2 years ago

Hi, @lucidrains. Thank you for sharing this excellent implementation with us all! Do you have any thoughts as to what changes would need to be made to make cross-attention possible with your FLASH model?

lucidrains commented 2 years ago

@amorehead hey Alex! the GAU module could be made to support cross attention, but not the FLASH transformer. The FLASH transformer design is very specific for autoregressive training

Kite0011 commented 1 year ago

Hi @lucidrains! Would you mean i can just imply GAU on cross-attention model such as t5? I foud GAU works very well on bert model