lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
MIT License
19.64k stars 2.96k forks source link

Add FNet as a optional implementation of ViT #123

Open FilipAndersson245 opened 3 years ago

FilipAndersson245 commented 3 years ago

Arxiv, Yannic The authors proposes that you can replace attention with Fourier transformations on BERT, this improves speed immensely (~6x) with just minor loss in predictive performance. maybe it would be interesting to examine if it can be integrated in ViT.

lessw2020 commented 3 years ago

I think this is also quite interesting. I'd recommend building the hybrid model implementation for more optimal speed/accuracy tradeoff:

"we found that adding self-attention sublayers to FNet models offers a simple way to trade off speed for accuracy... specifically replacing the final two Fourier sublayers of FNet with self-attention layers yielded a model that acheived 97% of BERT accuracy, but pre-trained six times as fast on gpus..."

And go with NesT transformer as the home instead of ViT :)