Closed adamsolomou closed 4 years ago
Hi,
Sorry for the late reply.
Of course they can be used without builders (or at least they should :-D ). I also believe that the non-builder interface might be more suitable for power users and the builders for a quick experimentation.
Having said that could you please provide some example code that causes the error? The following code works fine for me.
from fast_transformers.attention import AttentionLayer
from fast_transformers.builders import AttentionBuilder
from fast_transformers.feature_maps import Favor
from fast_transformers.masking import FullMask
import torch
lin_att = AttentionBuilder.from_kwargs(
query_dimensions=60,
feature_map=Favor.factory(n_dims=120)
).get("linear")
att = AttentionLayer(lin_att, 300, 300 // 60, 60, 60)
m = FullMask(20, 20)
l = FullMask(1, 20)
x = torch.rand(1, 20, 300)
att(x, x, x, m, l, l)
Also which version of PyTorch are you using? We have not included 1.7 in the continuous integration yet.
Cheers, Angelos
Hi,
So the issue arises during training (at the first training step). I get the following error.
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [60, 60]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
I believe the variable it refers to is omega
within the Favor feature map (since for d_query=60
and n_dims=120
omega
would have shape (60,60) and is modified by an inplace operation).
I declare the model as follows
attention_module = AttentionBuilder.from_kwargs(
query_dimensions=d_query,
feature_map=Favor.factory(n_dims=120)).get('linear')
attention_layer = AttentionLayer(attention_module,
d_model,
n_heads,
d_query,
d_values)
# Encoder model
transformer = TransformerEncoder(
[
TransformerEncoderLayer(
attention_layer,
d_model,
d_ff,
dropout,
activation
) for l in range(n_layers)
],
norm_layer = LayerNorm(d_model)
)
My overall model is a bit different as it has some additional embedding layers before the transformer and a classification head on top, so I am not quite sure how easy it to share the whole code. But the transformer backbone is created as above.
My training pipeline looks something like this
# Repeat for each epoch
for epoch in range(EPOCHS):
# Turn on the training mode
model.train()
# Loop over batches
for train_batch_idx, train_batch in enumerate(tqdm(train_loader)):
# Unpack batch
y = train_batch['label'].to(device)
x_tok = train_batch['tokens'].to(device)
length_mask = train_batch['length'].to(device)
optimizer.zero_grad()
# Call model
logits = model(x_tok, length_mask)
# Loss
loss = criterion(logits, y)
loss.backward()
# Optimization step
optimizer.step()
global_step += 1
# Update learning rate
optimizer = lr_scheduler(optimizer,
global_step,
LEARNING_RATE,
TRAIN_STEPS,
WARMUP_STEPS)
The overall model (model
) is the encoder-transformer specified above along with the additional layers mentioned earlier. If I change the way I declare the transformer model and use builders instead of the aforementioned declaration the overall code works fine.
Please let me know if something is not clear in my description. I am using torch 1.6.0+cu101
Many thanks, Adamos
Hi,
Everything is clear from your description. So there is a bug, but it is not what I had thought it would be and to be honest I am not sure what the best solution is.
The way you are creating the your transformer encoder, the attention layer is shared among all transformer layers, similar to ALBERT. However, the way it is implemented the random matrix parameter is overwritten at every application of the feature map. This mean that the feature map (the way it is implemented) cannot be shared and used by multiple layers in the same forward/backward pass.
The first solution would be a kind of hack but you can quickly implement it and work with it until I implement a more robust solution in the feature map itself. You can create many attention layers that share the same parameters instead one attention layer. This way each attention layer would have their own feature map. The following code implements the aforementioned "hack":
from fast_attention.utils import make_mirror
# This builds n_layers independent attention modules
att_builder = AttentionBuilder.from_kwargs(
query_dimensions=d_query,
feature_map=Favor.factory(n_dims=120)
)
attention_layers = [
AttentionLayer(att_builder.get("linear"), d_model, n_heads, d_query, d_values)
for l in range(n_layers)
]
# This makes all of them mirrors of the first one
for i in range(1, n_layers):
make_mirror(attention_layers[0], attention_layers[i])
# Now you can build your transformer as usual
transformer = TransformerEncoder(
[
TransformerEncoderLayer(
attention_layer,
d_model,
d_ff,
dropout,
activation
) for attention_layer in attention_layers
],
norm_layer=LayerNorm(d_model)
)
Obviously the feature map should work correctly out of the box similar to an autograd function without the implementation caring if it is applied many times during a forward pass. The most likely permanent solution would be to never overwrite the random matrix tensor and simply create a new one when needed. I will have to check if this has any performance implications as well as how to access the device to create the tensor for (maybe through new_feature_map()
).
I will probably implement the fix and push it before Monday.
Thanks for being patient (it has been almost 10 days).
Cheers, Angelos
I actually pushed a fix. Let me know if you are still experiencing problems and feel free to reopen this issue in that case.
Cheers, Angelos
Hi Angelos,
I see the problem and thanks for taking care of it 🙂 I haven't used the updated version of the code yet but I will let you know if I experience any problems related with this issue.
Nonetheless, I would like to ask some clarifying questions based on your previous comment:
transformer_builder = TransformerEncoderBuilder.from_kwargs(
n_layers=n_layers,
n_heads=n_heads,
query_dimensions=d_query,
value_dimensions=d_values,
feed_forward_dimensions=d_ff,
activation=activation,
dropout=dropout,
final_normalization=norm_layer)
if attention_type == 'softmax':
transformer_builder.attention_type = 'full'
elif attention_type == 'softmax-kernel':
transformer_builder.attention_type = 'linear'
transformer_builder.feature_map=partial(Favor,
n_dims=2*d_query,
stabilize=True)
If yes, is there is a way not to enforce cross-layer parameter sharing?
# This makes all of them mirrors of the first one
for i in range(1, n_layers):
make_mirror(attention_layers[0], attention_layers[i])
from fast_transformers.attention import AttentionLayer, FullAttention
bert = TransformerEncoder(
[
TransformerEncoderLayer(
AttentionLayer(FullAttention(), 768, 12),
768,
12,
activation="gelu"
) for l in range(12)
],
norm_layer=torch.nn.LayerNorm(768)
)
Many thanks in advance!
Hi,
Cheers, Angelos
Hi,
I have been trying to use feature maps without using builders to construct the model but I experience an error during training.
I have tried the following:
I have also tried to define only the self-attention module using builders as follows:
But both ways give rise to the following error (in my case d_query=60)
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [60, 60]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
Error detected in MmBackward. The traceback of the forward call that caused the error terminates at:
File "/fast_transformers/feature_maps/fourier_features.py", line 185, in forward u = x.unsqueeze(-2).matmul(self.omega).squeeze(-2)
If I switch to using entirely builders to define the model the problem does not appear. But I was wondering if the random fourier features can be used outside builders? (as I personally prefer the vanilla interface).
Many thanks in advance!