Closed gaceladri closed 3 years ago
Hi,
I am not sure I follow your modifications regarding masking. For instance we cannot apply a NxN mask with linear attention because the attention matrix is never computed explicitly. For instance, what is the head mask in this case? Have you checked that given the same weights and the same inputs both HuggingFace linear attention and ours return the same result?
In general there are no special tricks used for training and I have not seen any type of instability. Could you provide more information regarding that? Sequence length, query size and possibly the range of the values of the normalizer Z
? As in any transformer you could also use learning rate warmup and gradient clipping but I wouldn't say that I have experience divergence otherwise.
Cheers, Angelos
I took the masking that I thought that you do when using linear_att.
N = input_shape[0]
L = input_shape[1]
if input_ids is not None:
x = input_ids
elif inputs_embeds is not None:
x = inputs_embeds
else:
raise ValueError("You have to specify either input_ids or inputs_embeds")
extended_attention_mask = FullMask(L, device=x.device)
# extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 # ¿?
head_mask = LengthMask(x.new_full((N,), L, dtype=torch.int64))
I assumed that this should be the head_masking <- your LengthMask class. I took it from somewhere in your code thinking that this should be done like that. It is not the case?
edit: I am modifying this model from huggingface: https://github.com/huggingface/transformers/blob/6e1ee47b361f9dc1b6da0104d77b38297042efae/src/transformers/models/mobilebert/modeling_mobilebert.py#L875 Edit 2: sorry. I copied my code as is. I am going to remove this comments on my first post because those comments are regarding to the original huggingface masking.
Possibly :-). Maybe the name got me confused. The masking is either on the attention matrix (which for linear should be all ones) or per sample which I consider it to be the lengths of each sequence, namely how many keys for each sample, batch_size x sequence length
.
If head mask is simply passed to key_lengths
then it should be fine.
By the way, are you aware of any linear att. implementation in huggingface?
Definitely I have something weird: I modified my linear att with fixup to remove layernorm. I need to check what is happening...
I think that I have to fix fixup xD Now it is more consistent but it is getting behind wrt. the original mobilebert. I am using selu instead of elu. I will try with some of them. Thanks for your support. Also if you'r aware of any other implementation to check please let me know.
Closing as it is not an issue.
Best. Adrian
edit: with celu seems much better
Good to know! :-)
Regarding the falling behind, that could simply be due to linear. Since the attention matrix is now low rank, learning is bound to be harder. The whole point is about performance/wall-clock time tradeoffs.
If your sequence is large then linear is going to be significantly faster, and since the performance difference in 20k steps is minuscule, linear is a better choice.
(I was replying when I saw your edit) The fact that celu works better could be interesting, however, if linear is not faster softmax then there is little point in using it. So the bottom line is unless you care about speed or memory, your are probably better off with softmax. If you do care however (e.g. you are processing sequences 10k long or you only have an rtx 2060 or ...) then a small performance drop could be expected. You can always increase the number of layers or heads to compensate.
Let me know if I can help in any way.
Cheers, Angelos
Hello,
I have migrated your linear_attention.py to be compatible with huggingface. I also have modified the masking part to do the LenghtMask.
The thing is that the model is very brittle and use to diverge. It is very sensitive to hyper-parameters and initialization.
Do you have some tips and tricks to train the linear_attention?
Thanks!