lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch
MIT License
1.07k stars 143 forks source link

hyperbolic cosine based estimator #73

Open gaganbahga opened 3 years ago

gaganbahga commented 3 years ago

Hi @lucidrains thanks a lot for providing this implementation, The authors of the paper proposed two types of estimators for Positive Random Features: the one based on exponential functions (referred to as SM+), and another one based on cosh, referred to as SM hyp+ But I think in the jax/TF implementation, and thus in this repository as well, the implementation is provided for the exponential one only. In the paper, the authors mention that "Furthermore, the hyperbolic estimator provides additional accuracy improvements that are strictly better than those from SM+ 2m(x, y) with twice as many random features." So it seems like the default choice should have been the cosh based estimator, but it is not. Would you happen to have more insights into this? Also does the ortho_scaling=1 option switch on the regularized softmax-kernel (SMREG)? Is it recommended to use that anywhere? The authors have mentioned that ortho_scaling = 0.0 as the default hyperparameter choice though.