Open FilipAndersson245 opened 3 years ago
I think this is also quite interesting. I'd recommend building the hybrid model implementation for more optimal speed/accuracy tradeoff:
"we found that adding self-attention sublayers to FNet models offers a simple way to trade off speed for accuracy... specifically replacing the final two Fourier sublayers of FNet with self-attention layers yielded a model that acheived 97% of BERT accuracy, but pre-trained six times as fast on gpus..."
And go with NesT transformer as the home instead of ViT :)
Arxiv, Yannic The authors proposes that you can replace attention with Fourier transformations on BERT, this improves speed immensely (~6x) with just minor loss in predictive performance. maybe it would be interesting to examine if it can be integrated in ViT.