I was testing the version on master with local attention. And apparently there is a bug when using mask. It is always gives nan values if I use mask. Using other types of attentions such as full or linear works good.
If I do not use a length_mask, local attention works.
I attached a small code where you can reproduce the error.
Hello @angeloskath and fast-transformers team.
I was testing the version on master with local attention. And apparently there is a bug when using mask. It is always gives nan values if I use mask. Using other types of attentions such as full or linear works good. If I do not use a length_mask, local attention works.
I attached a small code where you can reproduce the error.
bug_local.zip
Here is the output of Linear and Local. I'm using Python 3.8, Pytorch 1.6 and without cuda