Local Product CUDA Kernel

Nice library. I have a question regarding the local product (longformer sliding window) kernel you have implemented. If I am correctly interpreting the implementation here, the KQ^T operation is decomposed into blocks of size 64 along the num_queries dimension which are then dotted via the gemm implementation in cuBlas with a window of 64 +- context_window/2 keys. The local_context window for each query is then copied out with a custom copy kernel.

With this implementation, the dot products for a much larger context window than local_context are computed but then subsequently ignored. Since these computations already happen, is it true that setting local_context to any value in [2,64] would essentially not alter the latency of the implementation but likely improve the generalization ability of the end transformer (due to the larger context window of each layer)?

Thanks!

Hi Andriy,

Sorry for the belated reply. What you wrote is probably correct but I will try to first answer to your questions and then clarify the code a bit to make sure that we are on the same page.

With this implementation, the dot products for a much larger context window than local_context are computed: Not a much larger, instead of performing 64x(local_context) dot products we perform 64x(local_context+64) dot products. So all in all we perform precisely 4,096 extra dot products. However, that does not mean that it costs more than not computing them. GPUs are SIMD so chances are we are going to waste some cycles anyway (due to if checks etc).
Since these computations already happen, is it true that setting local_context to any value in [2,64] would essentially not alter the latency: Probably, but I cannot be sure. It depends on both the cuBLAS implementation and our copy kernel. However, small local contexts would definitely be sub-optimal.
but likely improve the generalization ability of the end transformer: Most likely, I mean the context window is a hyper-parameter to be tuned anyway.

Implementation description

The local product is implemented as follows:

Given a query matrix for example 192x32 (ignoring heads and batch), a similar key matrix and a local context of 32
We first multiply the first 64 queries with the first 64+32/2 keys
We collect the valid values from the attention matrix
Then we multiply the second 64 queries with the keys from index 48 to 144 (namely 64+32 keys)
We collect the valid values from the attention matrix
Finally we multiply the last 64 queries with the last 64+32/2 keys
We collect the valid values from the attention matrix

As you can see, we have at most an extra 64 dot products per query so at most 4,096 total extra dot products. Also the number 64 is a parameter so if needed in the future we could possibly have a dynamic dispatch for smaller context windows (although I don't think it would help because the GPU would not have enough work to do efficiently).

Let me know if I have helped.

Cheers, Angelos

idiap / fast-transformers

Local Product CUDA Kernel #51

Implementation description