Question over cuda implementation of causal product (forward)

idiap / fast-transformers

Pytorch library for fast transformer implementations

1.65k stars 179 forks source link

Hi!

Let me start off by saying this is incredible work, it has been very useful (both the code and your research) to me!

I'm currently looking into the CUDA implementations (The non nvidia optimized version). I've just started reading CUDA code, so questions might be stupid:

https://github.com/idiap/fast-transformers/blob/master/fast_transformers/causal_product/causal_product_cuda.cu#L1248-L1249 Those values should never be different from 0 no?
Probably linked to the first question https://github.com/idiap/fast-transformers/blob/master/fast_transformers/causal_product/causal_product_cuda.cu#L1261-L1262 I'm not sure I see why there's a need to update k_v here. Technically we can just drop k_v once we've computed res and added it to result no? To my understanding only threads within a same block have access to the same shared memory.
- What's the purpose of having E-blocks and how did you come to put 4 as its value? Unless I'm wrong you could create E x N x H blocks instead (ie single element block)

Hi Thomas,

Thanks for your good words and sorry for the late reply.

Your questions are far from stupid. @qibinc made a PR that improved the kernel for large query and value dimensions because the small dimensions were now handled by the nvidia improvements (by @jdemouth-nvidia). I accepted the PR without looking into it much since the tests were passing and it was indeed faster. Having said that, I did update the kernel after your observations which now makes more sense and it is also slightly faster. The speed improvement was not just due to saving the KV but due to increasing E blocks a bit. So without further ado:

The values should indeed be 0. Maybe in the future we could pass a previously computed KV for checkpointing but for now this can be removed.
You are again correct it doesn't need to be stored.
Increasing the E blocks increases the shared memory requirement but it also increases the speed because each thread has more things to do. Number 4 allows for larger value dimensions since E_BLOCK_SIZE * M * sizeof(float) should be less than 48k so M <= 3000. I increased it to 8 and now the limits of the kernel are ~1500 for both M and E.

Let me know if anything I wrote is unclear or if you have more questions.

Cheers, Angelos

idiap / fast-transformers

Question over cuda implementation of causal product (forward) #91