Open DanTaranis opened 1 year ago
fyi - I did a quick poc with cifar 10 + a small ViT trained with and without kwta (90% sparsity) - and the kwta actually worked a bit like a regularization (slightly higher max validation accuracy + slower convergence).
so looks like this definitely has potential. my team /I may look farther into this if you want to collaborate on a paper or something.
Hey - first of all - thank you for your inspiring research.
there's a lot of work around how to make efficient self attention - especially as the sequence length increases. it seems to me that in assumption of kwta - you could skip the vast majority of calculations due to the inherent extreme sparsity.
and the best part is it would be complementary to many of the linear complexity attention methods that are coming out.
Are you experimenting with something like that?
Regards, Dan