idiap / fast-transformers

Pytorch library for fast transformer implementations
1.65k stars 179 forks source link

Feature Maps without using builders #47

Closed adamsolomou closed 4 years ago

adamsolomou commented 4 years ago

Hi,

I have been trying to use feature maps without using builders to construct the model but I experience an error during training.

I have tried the following:


attention_layer = AttentionLayer(
            LinearAttention(d_query, feature_map=Favor.factory(n_dims=120)), 
            d_model, 
            n_heads, 
            d_query, 
            d_values)

transformer = TransformerEncoder(
            [
                TransformerEncoderLayer(
                    attention_layer,
                    d_model,
                    d_ff,
                    dropout,
                    activation
                ) for l in range(n_layers)
            ],
            norm_layer = LayerNorm(d_model)
            )

I have also tried to define only the self-attention module using builders as follows:


attention_module = AttentionBuilder.from_kwargs(
                query_dimensions=d_query, 
                feature_map=Favor.factory(n_dims=120)).get('linear')

attention_layer = AttentionLayer(attention_module, 
                                 d_model, 
                                 n_heads, 
                                 d_query, 
                                 d_values)

But both ways give rise to the following error (in my case d_query=60)

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [60, 60]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Error detected in MmBackward. The traceback of the forward call that caused the error terminates at:

File "/fast_transformers/feature_maps/fourier_features.py", line 185, in forward u = x.unsqueeze(-2).matmul(self.omega).squeeze(-2)

If I switch to using entirely builders to define the model the problem does not appear. But I was wondering if the random fourier features can be used outside builders? (as I personally prefer the vanilla interface).

Many thanks in advance!

angeloskath commented 4 years ago

Hi,

Sorry for the late reply.

Of course they can be used without builders (or at least they should :-D ). I also believe that the non-builder interface might be more suitable for power users and the builders for a quick experimentation.

Having said that could you please provide some example code that causes the error? The following code works fine for me.

from fast_transformers.attention import AttentionLayer
from fast_transformers.builders import AttentionBuilder
from fast_transformers.feature_maps import Favor
from fast_transformers.masking import FullMask
import torch

lin_att = AttentionBuilder.from_kwargs(
    query_dimensions=60,
    feature_map=Favor.factory(n_dims=120)
).get("linear")
att = AttentionLayer(lin_att, 300, 300 // 60, 60, 60)
m = FullMask(20, 20)
l = FullMask(1, 20)
x = torch.rand(1, 20, 300)
att(x, x, x, m, l, l)

Also which version of PyTorch are you using? We have not included 1.7 in the continuous integration yet.

Cheers, Angelos

adamsolomou commented 4 years ago

Hi,

So the issue arises during training (at the first training step). I get the following error.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [60, 60]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I believe the variable it refers to is omega within the Favor feature map (since for d_query=60 and n_dims=120 omega would have shape (60,60) and is modified by an inplace operation).

I declare the model as follows

attention_module = AttentionBuilder.from_kwargs(
     query_dimensions=d_query, 
     feature_map=Favor.factory(n_dims=120)).get('linear')

attention_layer = AttentionLayer(attention_module, 
                                 d_model, 
                                 n_heads, 
                                 d_query, 
                                 d_values)

# Encoder model 
transformer = TransformerEncoder(
            [
                TransformerEncoderLayer(
                    attention_layer,
                    d_model,
                    d_ff,
                    dropout,
                    activation
                ) for l in range(n_layers)
            ],
            norm_layer = LayerNorm(d_model)
            )

My overall model is a bit different as it has some additional embedding layers before the transformer and a classification head on top, so I am not quite sure how easy it to share the whole code. But the transformer backbone is created as above.

My training pipeline looks something like this

# Repeat for each epoch 
for epoch in range(EPOCHS): 
    # Turn on the training mode
    model.train()

    # Loop over batches 
    for train_batch_idx, train_batch in enumerate(tqdm(train_loader)):
        # Unpack batch 
        y = train_batch['label'].to(device)
        x_tok = train_batch['tokens'].to(device)
        length_mask = train_batch['length'].to(device)

        optimizer.zero_grad()

        # Call model 
        logits = model(x_tok, length_mask)

        # Loss 
        loss = criterion(logits, y)
        loss.backward()

        # Optimization step 
        optimizer.step()
        global_step += 1

        # Update learning rate 
        optimizer = lr_scheduler(optimizer, 
                                 global_step, 
                                 LEARNING_RATE, 
                                 TRAIN_STEPS, 
                                 WARMUP_STEPS)

The overall model (model) is the encoder-transformer specified above along with the additional layers mentioned earlier. If I change the way I declare the transformer model and use builders instead of the aforementioned declaration the overall code works fine.

Please let me know if something is not clear in my description. I am using torch 1.6.0+cu101

Many thanks, Adamos

angeloskath commented 4 years ago

Hi,

Everything is clear from your description. So there is a bug, but it is not what I had thought it would be and to be honest I am not sure what the best solution is.

The way you are creating the your transformer encoder, the attention layer is shared among all transformer layers, similar to ALBERT. However, the way it is implemented the random matrix parameter is overwritten at every application of the feature map. This mean that the feature map (the way it is implemented) cannot be shared and used by multiple layers in the same forward/backward pass.

Quick & dirty solution

The first solution would be a kind of hack but you can quickly implement it and work with it until I implement a more robust solution in the feature map itself. You can create many attention layers that share the same parameters instead one attention layer. This way each attention layer would have their own feature map. The following code implements the aforementioned "hack":

from fast_attention.utils import make_mirror

# This builds n_layers independent attention modules
att_builder = AttentionBuilder.from_kwargs(
    query_dimensions=d_query,
    feature_map=Favor.factory(n_dims=120)
)
attention_layers = [
    AttentionLayer(att_builder.get("linear"), d_model, n_heads, d_query, d_values)
    for l in range(n_layers)
]

# This makes all of them mirrors of the first one
for i in range(1, n_layers):
    make_mirror(attention_layers[0], attention_layers[i])

# Now you can build your transformer as usual
transformer = TransformerEncoder(
    [
        TransformerEncoderLayer(
            attention_layer,
            d_model,
            d_ff,
            dropout,
            activation
        ) for attention_layer in attention_layers
    ],
    norm_layer=LayerNorm(d_model)
)

More permanent solution

Obviously the feature map should work correctly out of the box similar to an autograd function without the implementation caring if it is applied many times during a forward pass. The most likely permanent solution would be to never overwrite the random matrix tensor and simply create a new one when needed. I will have to check if this has any performance implications as well as how to access the device to create the tensor for (maybe through new_feature_map()).

I will probably implement the fix and push it before Monday.

Thanks for being patient (it has been almost 10 days).

Cheers, Angelos

angeloskath commented 4 years ago

I actually pushed a fix. Let me know if you are still experiencing problems and feel free to reopen this issue in that case.

Cheers, Angelos

adamsolomou commented 4 years ago

Hi Angelos,

I see the problem and thanks for taking care of it 🙂 I haven't used the updated version of the code yet but I will let you know if I experience any problems related with this issue.

Nonetheless, I would like to ask some clarifying questions based on your previous comment:

  1. Does cross-layer parameter sharing (of attention) also occur if the model is declared using builders? That is, if I use the following declaration would the attention parameters be shared across all layers?
transformer_builder = TransformerEncoderBuilder.from_kwargs(
    n_layers=n_layers, 
    n_heads=n_heads, 
    query_dimensions=d_query, 
    value_dimensions=d_values, 
    feed_forward_dimensions=d_ff,
    activation=activation, 
    dropout=dropout, 
    final_normalization=norm_layer)

if attention_type == 'softmax': 
    transformer_builder.attention_type = 'full'
elif attention_type == 'softmax-kernel': 
    transformer_builder.attention_type = 'linear'
    transformer_builder.feature_map=partial(Favor, 
                                            n_dims=2*d_query, 
                                            stabilize=True)

If yes, is there is a way not to enforce cross-layer parameter sharing?

  1. The proposed "Quick & Dirty Solution" also implements cross-layer parameter sharing, right? If so, would removing the following lines avoid weight-sharing?
# This makes all of them mirrors of the first one
for i in range(1, n_layers):
    make_mirror(attention_layers[0], attention_layers[i])
  1. Lastly, does the following model declaration also impose weight sharing?
from fast_transformers.attention import AttentionLayer, FullAttention

bert = TransformerEncoder(
    [
        TransformerEncoderLayer(
            AttentionLayer(FullAttention(), 768, 12),
            768,
            12,
            activation="gelu"
        ) for l in range(12)
    ],
    norm_layer=torch.nn.LayerNorm(768)
)

Many thanks in advance!

angeloskath commented 3 years ago

Hi,

  1. No. When using the builders, each transformer layer has a different attention layer. So no parameter sharing. The creation code follows along the lines you wrote in bullet point 3.
  2. Yes, exactly. These lines are implementing the parameter sharing.
  3. Nope. Since you are creating a new attention layer for every transformer layer then no weights are shared.

Cheers, Angelos