Closed caffeinetoomuch closed 3 years ago
Hi,
Thanks for liking the repo.
CUDA version should not greatly affect the performance.
Could you share the sequence length, query/value dimensions and random features that you are using? For small sequences linear attention is as fast as softmax (or even slower).
Cheers, Angelos
For profiling I used the sequence size of 1024(padded to right) and the model was t5-large
. So dimension was 1024 for both query and value(Finetuned summarization model) and used None for random features. The linear attention was taking twice longer than the softmax for one forward
call. Will linear attention be faster with large sequence size?
Thanks!
So if I understand correctly, d_model is 1024 and you are using 16 heads. The following script is measuring the forward pass time for linear attention, softmax attention and linear attention with Favor for a 4 layer transformer like this:
import torch
from fast_transformers.builders import TransformerEncoderBuilder
from fast_transformers.masking import TriangularCausalMask
from fast_transformers.feature_maps import Favor
t1 = TransformerEncoderBuilder.from_kwargs(n_layers=4, n_heads=16, query_dimensions=64, attention_type="causal-linear").get().cuda()
t2 = TransformerEncoderBuilder.from_kwargs(n_layers=4, n_heads=16, query_dimensions=64, attention_type="full").get().cuda()
t3 = TransformerEncoderBuilder.from_kwargs(n_layers=4, n_heads=16, query_dimensions=64, attention_type="causal-linear", feature_map=Favor.factory()).get().cuda()
x1 = torch.randn(24, 1024, 1024).cuda()
x2 = torch.randn(24, 1024, 1024).cuda()
x3 = torch.randn(24, 1024, 1024).cuda()
attn_mask = TriangularCausalMask(1024, device="cuda")
def cuda_time(t, x, m):
# warmup
t(x, attn_mask=m)
t(x, attn_mask=m)
# measure
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
t(x, attn_mask=m)
end.record()
torch.cuda.synchronize()
return start.elapsed_time(end)
with torch.no_grad():
print("Linear", cuda_time(t1, x1, attn_mask))
print("Full", cuda_time(t2, x2, attn_mask))
print("Favor", cuda_time(t3, x3, attn_mask))
On my RTX 2060S, using the code from master I get the following results
Linear 247.43116760253906
Full 427.42333984375
Favor 285.3621826171875
Obviously we can use a larger batch size with linear attention so the difference could be even larger per sample. Let me know what you have different in your configuration.
Cheers, Angelos
Hi,
I assume that the problem was something else and this is now solved. If it is not the case, feel free to reopen the issue.
Best, Angelos
Currently, I am using the performer repo that is using this repo's
CausalDotProduct
.However, while benchmarking the model, we found out that performer attention was taking longer CUDA time than the original attention. We benchmarked the forward call and it was
causal_dot_product_kernel
was the main bottleneck. I am aware that recent merge #77 madecausal_dot_product_kernel
faster. I ran the model on Ubuntu 18.04, with CUDA 11.2 and pytorch 1.8.1+cu111 installed. Should CUDA version affect the performance of thecausal_dot_product_kernel
?Thanks for the great repo!