CUDA version and CausalDotProduct time

caffeinetoomuch commented 3 years ago

Currently, I am using the performer repo that is using this repo's CausalDotProduct.

However, while benchmarking the model, we found out that performer attention was taking longer CUDA time than the original attention. We benchmarked the forward call and it was causal_dot_product_kernel was the main bottleneck. I am aware that recent merge #77 made causal_dot_product_kernel faster. I ran the model on Ubuntu 18.04, with CUDA 11.2 and pytorch 1.8.1+cu111 installed. Should CUDA version affect the performance of the causal_dot_product_kernel?

Thanks for the great repo!

angeloskath commented 3 years ago

Hi,

Thanks for liking the repo.

CUDA version should not greatly affect the performance.

Could you share the sequence length, query/value dimensions and random features that you are using? For small sequences linear attention is as fast as softmax (or even slower).

Cheers, Angelos

caffeinetoomuch commented 3 years ago

For profiling I used the sequence size of 1024(padded to right) and the model was t5-large. So dimension was 1024 for both query and value(Finetuned summarization model) and used None for random features. The linear attention was taking twice longer than the softmax for one forward call. Will linear attention be faster with large sequence size?

Thanks!

angeloskath commented 3 years ago

So if I understand correctly, d_model is 1024 and you are using 16 heads. The following script is measuring the forward pass time for linear attention, softmax attention and linear attention with Favor for a 4 layer transformer like this:

import torch
from fast_transformers.builders import TransformerEncoderBuilder
from fast_transformers.masking import TriangularCausalMask
from fast_transformers.feature_maps import Favor

t1 = TransformerEncoderBuilder.from_kwargs(n_layers=4, n_heads=16, query_dimensions=64, attention_type="causal-linear").get().cuda()
t2 = TransformerEncoderBuilder.from_kwargs(n_layers=4, n_heads=16, query_dimensions=64, attention_type="full").get().cuda()
t3 = TransformerEncoderBuilder.from_kwargs(n_layers=4, n_heads=16, query_dimensions=64, attention_type="causal-linear", feature_map=Favor.factory()).get().cuda()
x1 = torch.randn(24, 1024, 1024).cuda()
x2 = torch.randn(24, 1024, 1024).cuda()
x3 = torch.randn(24, 1024, 1024).cuda()
attn_mask = TriangularCausalMask(1024, device="cuda")

def cuda_time(t, x, m):
    # warmup
    t(x, attn_mask=m)
    t(x, attn_mask=m)

    # measure
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    t(x, attn_mask=m)
    end.record()
    torch.cuda.synchronize()

    return start.elapsed_time(end)

with torch.no_grad():
    print("Linear", cuda_time(t1, x1, attn_mask))
    print("Full", cuda_time(t2, x2, attn_mask))
    print("Favor", cuda_time(t3, x3, attn_mask))

On my RTX 2060S, using the code from master I get the following results

Linear 247.43116760253906
Full 427.42333984375
Favor 285.3621826171875

Obviously we can use a larger batch size with linear attention so the difference could be even larger per sample. Let me know what you have different in your configuration.

Cheers, Angelos

angeloskath commented 3 years ago

Hi,

I assume that the problem was something else and this is now solved. If it is not the case, feel free to reopen the issue.

Best, Angelos

idiap / fast-transformers

CUDA version and CausalDotProduct time #83