Closed lucidrains closed 3 years ago
Hi Phil! Thanks a lot for your interest and for all your contributions to the open-source community, I have used many of your implementations :)
I think that for the autoregressive case, we could always allow a token to attend to itself to avoid this issue. Another thing to try is to use local attention together with asymmetric LSH. My intuition is that if a token is matched only with tokens from the future across all hashing rounds, then the local attention output will get a higher value at the final softmax (that we use to merge different hashing rounds).
Looking forward to hearing your thoughts on that.
Giannis
Thank you for the kind words! I'm glad the repositories were helpful!
Another researcher and I tried solving the problem as you suggested, by appending noop memory tokens, but results seem to be still much worse than shared QK in the auto-regressive case. Local attention as a fallback is a good suggestion, but I fear for the earlier tokens in the sequence not getting the attention they deserve. Do let me know if you ever come across a solution to this!
Otherwise, asymmetric LSH seems promising for the non-autoregressive case! https://arxiv.org/abs/2011.09315
I will play with E2LSH a bit more and then see if I can improve on Reformers by enhancing the encoders and cross attention with separate query / key spaces.
Thank you for the great paper! Phil
Hi Giannis!
Thanks for the great paper! I am interested in your asymmetric LSH, as I think having separate query / key space (as opposed to shared QK as in Reformer) will bring performance improvements in LSH-based attention.
I saw that you recommended to a previous user to use this form of clustering for the auto-regressive case, and just wanted to probe if you had considered the scenario where a bucket of queries do not get matched with any keys from the past at all. This was an issue I had with trying to make separate QK space work with routing transformer, but just wondering if you had identified and found a solution to this problem.
Phil