Question: Scaling down number of random features depending on number of heads?

lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch

MIT License

1.08k stars 141 forks source link

Question: Scaling down number of random features depending on number of heads? #4

Closed Parskatt closed 3 years ago

Parskatt commented 3 years ago

The theory for the paper gives a result which gives some guarantees for nb_features = O(dimlog(dim)). When using multiple heads, e.g. dim = 512, heads = 8, you would get a lower dimensionality per head, is it then reasonable to scale the dimension of nb_features = O((dim/heads)log(dim/heads)) ? Or is the variance too high when the number of features gets too low? Do you have any intuition for this, cause I'm feeling a bit unsure.

lucidrains commented 3 years ago

@Parskatt I think it makes sense for this to be dim / heads. For a standard 1024 dim and 8 heads, 128 * log(128) ~= 256

lucidrains commented 3 years ago

@Parskatt I was told this hyperparameter is pretty critical to good performance

lucidrains commented 3 years ago

@Parskatt Let us know what you find in your experiments!

Parskatt commented 3 years ago

I will continue my experiments and let you know later :)