lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch
MIT License
1.07k stars 143 forks source link

Performer Pytorch Slower than Expected and Please Help with Understanding Parameter Count #92

Open michaelweihaosong opened 1 year ago

michaelweihaosong commented 1 year ago

Hi,

First of all, this is a great package from lucidrains and I find it very helpful in my research.

A quick question is that I noticed ViT-performer is slower than the regular ViT from lucidrains. For example running on mnist from pytorch will take 15 sec/epoch for regular ViT with the configuration below while ViT performer takes 23 sec/epoch.

Checking the parameter count also shows ViT-performer has double the size of regular ViT.

Screen Shot 2022-12-12 at 11 32 41 PM Screen Shot 2022-12-12 at 11 28 50 PM

I am hoping that someone has intuition about the speed of ViT performer vs regular ViT and their parameter counts.

Thank you very much in advance!

michaelweihaosong commented 1 year ago

Just found out why model size is twice as big.

feed forward layer has a multiplier of 4 for the dimension, after adding ff_mult=1, it's the same size.

Screen Shot 2022-12-13 at 12 15 29 AM

However, performer is still slow compared to the regular ViT using torchvision.datasets.MNIST training set on RTX 3090

Regular ViT: Average seconds for training 1 epoch: 15.101385951042175 Average seconds for testing: 0.6326647281646729

Performer ViT: Average seconds for training 1 epoch: 28.795904541015624 Average seconds for testing: 0.9286866903305053