Closed thomasw21 closed 3 years ago
Hi Thomas,
Thanks for your good words and sorry for the late reply.
Your questions are far from stupid. @qibinc made a PR that improved the kernel for large query and value dimensions because the small dimensions were now handled by the nvidia improvements (by @jdemouth-nvidia). I accepted the PR without looking into it much since the tests were passing and it was indeed faster. Having said that, I did update the kernel after your observations which now makes more sense and it is also slightly faster. The speed improvement was not just due to saving the KV but due to increasing E blocks a bit. So without further ado:
E_BLOCK_SIZE * M * sizeof(float)
should be less than 48k so M <= 3000
. I increased it to 8 and now the limits of the kernel are ~1500 for both M
and E
.Let me know if anything I wrote is unclear or if you have more questions.
Cheers, Angelos
Hi!
Let me start off by saying this is incredible work, it has been very useful (both the code and your research) to me!
I'm currently looking into the CUDA implementations (The non nvidia optimized version). I've just started reading CUDA code, so questions might be stupid:
res
and added it to result no? To my understanding only threads within a same block have access to the same shared memory.E-blocks
and how did you come to put 4 as its value? Unless I'm wrong you could create E x N x H blocks instead (ie single element block)