idiap / fast-transformers

Pytorch library for fast transformer implementations
1.65k stars 179 forks source link

Question over cuda implementation of causal product (forward) #91

Closed thomasw21 closed 3 years ago

thomasw21 commented 3 years ago

Hi!

Let me start off by saying this is incredible work, it has been very useful (both the code and your research) to me!

I'm currently looking into the CUDA implementations (The non nvidia optimized version). I've just started reading CUDA code, so questions might be stupid:

angeloskath commented 3 years ago

Hi Thomas,

Thanks for your good words and sorry for the late reply.

Your questions are far from stupid. @qibinc made a PR that improved the kernel for large query and value dimensions because the small dimensions were now handled by the nvidia improvements (by @jdemouth-nvidia). I accepted the PR without looking into it much since the tests were passing and it was indeed faster. Having said that, I did update the kernel after your observations which now makes more sense and it is also slightly faster. The speed improvement was not just due to saving the KV but due to increasing E blocks a bit. So without further ado:

Let me know if anything I wrote is unclear or if you have more questions.

Cheers, Angelos