idiap / fast-transformers

Pytorch library for fast transformer implementations
1.65k stars 179 forks source link

RuntimeError: CUDA error: invalid argument when running tests/attention/test_improved_clustered_transformer_gpu.py #42

Closed justimyhxu closed 4 years ago

justimyhxu commented 4 years ago

I have changed some hyperparameters of test_improved_clustered_transformer_gpu.py as shown in the following figure image When 'input length' is 475 and 'd_model' is larger than 1540, the script will meet the "RuntimeError: CUDA error: invalid argument." Could you tell me why it happened?

apoorv2904 commented 4 years ago

Hi,

d_model refers to the total embedding dimension of the encoder layer including all the attention heads, that is, d_model = query_dim * n_heads

It should be divisible by the number of attention heads

Could you try the following code and see if it works for you?

def test_improved_clustered_attention_forward():
    query_dim = 64
    n_heads = 8
    d_model = query_dim * n_heads
    transformer = TransformerEncoder([
        TransformerEncoderLayer(
            AttentionLayer(
                ImprovedClusteredAttention(
                    clusters=100,
                    topk=32,
                    bits=63
                ),
                d_model,
                n_heads
            ),
            d_model,
            n_heads
        )
        for i in range(6)
    ])
    transformer = transformer.to('cuda')
    x = torch.rand(1, 1700 // 4, d_model).cuda()
    y = transformer(x)

Thanks, Apoorv

justimyhxu commented 4 years ago

When d_model is 2048, it will also meet the error. I found that when d_model is larger than 1540, the script will meet the runtime error. You can change the query_dim in your script to be 2048 and can reproduce the case.

apoorv2904 commented 4 years ago

Hi,

You are right, I get an error when I set query_dim=2048. However, note that d_model is not equivalent to query_dim.

query_dim refers to the query embedding size used for a single attention head, d_model refers to the embedding size when all attention heads are combined, i.e. d_model = query_dim * n_heads

With that said, there are two sources of bug which arise due to the use of shared memory optimization in our Cuda implementations. I will write down the constraints placed by these sources

  1. Hashing Kernel: This places a constraint that query_dim <= (1024 * 12) / (bits + 1)
  2. Sparse Dot Product Kernel: This places the constraint that query_dim <= (1024*12) / (2*topk)

Hashing Kernel is used by both clustered and improved-clustered attention. Sparse Dot Product is only used by improved-clustered.

Note that the constraint is required to be met for a single attention head, you can still set n_heads to anything to get a much higher d_model.

For instance, if you set query_dim = 192 and n_heads = 30, it gives a d_model of 5760 and runs without error. Is it only for testing or do you need the query_dim of each head to be that high?

I will keep the bug open to provide fixes to some of these or graceful exits.

Thanks, Apoorv

justimyhxu commented 4 years ago

Thank you for your reply. Indeed, the reason why I set d_model=2048 is that the extracted feature from ResNet50-C5 is 2048, and I separate the input channel into 8 groups. Could you tell me how to change the cuda code for the sparse dot product to fix this error?

apoorv2904 commented 4 years ago

I wouldn't advise changing the sparse dot product as the first step.

I would rather introduce a linear layer that would project from 2048 to d_model where d_model = query_dim n_heads. query_dim and n_heads can be your Transformer specific hyperparameters. I would suggest keeping query_dim = 32 or 64*.

I have provided an example of using the Transformer on top of arbitrary features:

import torch
from torch.nn import Linear, Module

from fast_transformers.attention import AttentionLayer, \
    ClusteredAttention, ImprovedClusteredAttention
from fast_transformers.transformers import TransformerEncoderLayer, TransformerEncoder

class FeatureTransformer(Module):
    def __init__(self, feature_dim,
                 d_query, n_heads, n_layers):
        super(FeatureTransformer, self).__init__()
        # This will project the features to d_model dimensions to be compatible with Transformer
        d_model = d_query * n_heads
        self.feat_proj = Linear(feature_dim, d_model)

        # Actual Transformer
        self.transformer = TransformerEncoder([
            TransformerEncoderLayer(
                AttentionLayer(
                    ImprovedClusteredAttention(
                        clusters=100,
                        bits=63,
                        topk=32
                    ),  
                    d_model,
                    n_heads
                ),  
                d_model,
                n_heads
            )   
            for i in range(n_layers)
        ])    

    def forward(self, x): 
        return self.transformer(self.feat_proj(x))

def test_resnet_ic_attention():
    feature_dim = 2048  # This refers to the dimension of extracted features

    query_dim = 64      # Setting it to 64 as it is most commonly used
    n_heads = 8         # You can also change this if you want more heads
    n_layers = 6        # Number of encoder layers

    transformer = FeatureTransformer(
        feature_dim, query_dim, n_heads, n_layers
    ).to('cuda')

    N = 1
    T = 200
    # Generating dummy input to mimic incoming features
    x = torch.rand(N, T,  feature_dim).cuda()
    y = transformer(x)
    print(y.shape)

Let me know if this works or you have other questions.

--Apoorv

justimyhxu commented 4 years ago

Many thanks for your reply. I have decreased the channel, but I find my code can run for about 500iteration and will meet the error 'RuntimeError: CUDA error: an illegal memory access was encountered'. Do you have any idea about it?

apoorv2904 commented 4 years ago

Hi,

I will try to help. Given that your code is running, there could only be a few things going wrong.

  1. What is the number of clusters you set? Could it happen that for some batch, the maximum length or time steps are less than the clusters set?

I would suspect this to be the cause of the issue. If it is, I would advise to use attention_type = 'conditional-full:improved-clustered'. This basically falls back to full attention when sequence length is less than some preset limit. I have shown an example below:

from fast_transformers.builders import TransformerEncoderBuilder

query_dim = 64
n_layers = 6
n_heads = 8
ff_dim = query_dim * n_heads * 4

builder = TransformerEncoderBuilder.from_kwargs(
    n_layers=n_layers,
    n_heads=n_heads,
    query_dimensions=query_dim,
    value_dimensions=query_dim,
    feed_forward_dimensions=ff_dim
)

# Improved-Clustered
builder.attention_type = "conditional-full:improved-clustered"
builder.attention.clusters = 100
# Length limit below which we full attention
builder.attention.length_limit = 512
transformer= builder.get().cuda()

For other practical tips or an example of improved-clustered on a toy task you can now look at colab notebook we provide

I could help more if you could share the transformer architecture with some dummy input passed over a colab notebook or here.

Thanks, Apoorv

apoorv2904 commented 4 years ago

I am assuming that this was resolved. I will close this issue. Feel free to open again.

Thanks, Apoorv