giannisdaras / smyrf

[NeurIPS 2020] Official Implementation: "SMYRF: Efficient Attention using Asymmetric Clustering".
GNU General Public License v3.0
47 stars 5 forks source link

Auto-regressive #4

Closed lucidrains closed 3 years ago

lucidrains commented 3 years ago

Hi Giannis!

Thanks for the great paper! I am interested in your asymmetric LSH, as I think having separate query / key space (as opposed to shared QK as in Reformer) will bring performance improvements in LSH-based attention.

I saw that you recommended to a previous user to use this form of clustering for the auto-regressive case, and just wanted to probe if you had considered the scenario where a bucket of queries do not get matched with any keys from the past at all. This was an issue I had with trying to make separate QK space work with routing transformer, but just wondering if you had identified and found a solution to this problem.

Phil

giannisdaras commented 3 years ago

Hi Phil! Thanks a lot for your interest and for all your contributions to the open-source community, I have used many of your implementations :)

I think that for the autoregressive case, we could always allow a token to attend to itself to avoid this issue. Another thing to try is to use local attention together with asymmetric LSH. My intuition is that if a token is matched only with tokens from the future across all hashing rounds, then the local attention output will get a higher value at the final softmax (that we use to merge different hashing rounds).

Looking forward to hearing your thoughts on that.

Giannis

lucidrains commented 3 years ago

Thank you for the kind words! I'm glad the repositories were helpful!

Another researcher and I tried solving the problem as you suggested, by appending noop memory tokens, but results seem to be still much worse than shared QK in the auto-regressive case. Local attention as a fallback is a good suggestion, but I fear for the earlier tokens in the sequence not getting the attention they deserve. Do let me know if you ever come across a solution to this!

Otherwise, asymmetric LSH seems promising for the non-autoregressive case! https://arxiv.org/abs/2011.09315

I will play with E2LSH a bit more and then see if I can improve on Reformers by enhancing the encoders and cross attention with separate query / key spaces.

Thank you for the great paper! Phil