Closed justimyhxu closed 4 years ago
Hi,
d_model
refers to the total embedding dimension of the encoder layer including all the attention heads, that is,
d_model = query_dim * n_heads
It should be divisible by the number of attention heads
Could you try the following code and see if it works for you?
def test_improved_clustered_attention_forward():
query_dim = 64
n_heads = 8
d_model = query_dim * n_heads
transformer = TransformerEncoder([
TransformerEncoderLayer(
AttentionLayer(
ImprovedClusteredAttention(
clusters=100,
topk=32,
bits=63
),
d_model,
n_heads
),
d_model,
n_heads
)
for i in range(6)
])
transformer = transformer.to('cuda')
x = torch.rand(1, 1700 // 4, d_model).cuda()
y = transformer(x)
Thanks, Apoorv
When d_model is 2048, it will also meet the error. I found that when d_model is larger than 1540, the script will meet the runtime error. You can change the query_dim in your script to be 2048 and can reproduce the case.
Hi,
You are right, I get an error when I set query_dim=2048. However, note that d_model is not equivalent to query_dim.
query_dim refers to the query embedding size used for a single attention head,
d_model refers to the embedding size when all attention heads are combined, i.e. d_model = query_dim * n_heads
With that said, there are two sources of bug which arise due to the use of shared memory optimization in our Cuda implementations. I will write down the constraints placed by these sources
query_dim <= (1024 * 12) / (bits + 1)
query_dim <= (1024*12) / (2*topk)
Hashing Kernel is used by both clustered and improved-clustered attention. Sparse Dot Product is only used by improved-clustered.
Note that the constraint is required to be met for a single attention head, you can still set n_heads to anything to get a much higher d_model.
For instance, if you set query_dim = 192 and n_heads = 30, it gives a d_model of 5760 and runs without error. Is it only for testing or do you need the query_dim of each head to be that high?
I will keep the bug open to provide fixes to some of these or graceful exits.
Thanks, Apoorv
Thank you for your reply. Indeed, the reason why I set d_model=2048 is that the extracted feature from ResNet50-C5 is 2048, and I separate the input channel into 8 groups. Could you tell me how to change the cuda code for the sparse dot product to fix this error?
I wouldn't advise changing the sparse dot product as the first step.
I would rather introduce a linear layer that would project from 2048 to d_model where d_model = query_dim n_heads. query_dim and n_heads can be your Transformer specific hyperparameters. I would suggest keeping query_dim = 32 or 64*.
I have provided an example of using the Transformer on top of arbitrary features:
import torch
from torch.nn import Linear, Module
from fast_transformers.attention import AttentionLayer, \
ClusteredAttention, ImprovedClusteredAttention
from fast_transformers.transformers import TransformerEncoderLayer, TransformerEncoder
class FeatureTransformer(Module):
def __init__(self, feature_dim,
d_query, n_heads, n_layers):
super(FeatureTransformer, self).__init__()
# This will project the features to d_model dimensions to be compatible with Transformer
d_model = d_query * n_heads
self.feat_proj = Linear(feature_dim, d_model)
# Actual Transformer
self.transformer = TransformerEncoder([
TransformerEncoderLayer(
AttentionLayer(
ImprovedClusteredAttention(
clusters=100,
bits=63,
topk=32
),
d_model,
n_heads
),
d_model,
n_heads
)
for i in range(n_layers)
])
def forward(self, x):
return self.transformer(self.feat_proj(x))
def test_resnet_ic_attention():
feature_dim = 2048 # This refers to the dimension of extracted features
query_dim = 64 # Setting it to 64 as it is most commonly used
n_heads = 8 # You can also change this if you want more heads
n_layers = 6 # Number of encoder layers
transformer = FeatureTransformer(
feature_dim, query_dim, n_heads, n_layers
).to('cuda')
N = 1
T = 200
# Generating dummy input to mimic incoming features
x = torch.rand(N, T, feature_dim).cuda()
y = transformer(x)
print(y.shape)
Let me know if this works or you have other questions.
--Apoorv
Many thanks for your reply. I have decreased the channel, but I find my code can run for about 500iteration and will meet the error 'RuntimeError: CUDA error: an illegal memory access was encountered'. Do you have any idea about it?
Hi,
I will try to help. Given that your code is running, there could only be a few things going wrong.
I would suspect this to be the cause of the issue. If it is, I would advise to use attention_type = 'conditional-full:improved-clustered'. This basically falls back to full attention when sequence length is less than some preset limit. I have shown an example below:
from fast_transformers.builders import TransformerEncoderBuilder
query_dim = 64
n_layers = 6
n_heads = 8
ff_dim = query_dim * n_heads * 4
builder = TransformerEncoderBuilder.from_kwargs(
n_layers=n_layers,
n_heads=n_heads,
query_dimensions=query_dim,
value_dimensions=query_dim,
feed_forward_dimensions=ff_dim
)
# Improved-Clustered
builder.attention_type = "conditional-full:improved-clustered"
builder.attention.clusters = 100
# Length limit below which we full attention
builder.attention.length_limit = 512
transformer= builder.get().cuda()
For other practical tips or an example of improved-clustered on a toy task you can now look at colab notebook we provide
I could help more if you could share the transformer architecture with some dummy input passed over a colab notebook or here.
Thanks, Apoorv
I am assuming that this was resolved. I will close this issue. Feel free to open again.
Thanks, Apoorv
I have changed some hyperparameters of test_improved_clustered_transformer_gpu.py as shown in the following figure When 'input length' is 475 and 'd_model' is larger than 1540, the script will meet the "RuntimeError: CUDA error: invalid argument." Could you tell me why it happened?