Does it support causal mask for GPT2-esque models?

miguelvictor commented 3 years ago

For models with only Decoder-stacks, how to apply causal mask?

yyxiongzju commented 3 years ago

@miguelvictor, I looked at the causal self-attention from mingpt, `
# causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)

    att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))

    att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))

` It looks like the causal triangle mask needs to be applied for a O(T^2) matrix. In this case, it needs to make some changes for this model for applying causal mask. So I wonder if you want to apply Nystromformer for the above causal self-attention?

thomasw21 commented 3 years ago

So I have thought about it a bit, and it turns out causal masking can be computed (more or less).

If you apply a similar trick like the one describe in the performer paper (appendix B.1), you should be able to have causal masking. Basically you replace Q with the first two matrices in the Nystrom approximation.

However there's another issue: your landmark selection can't really be causal (at least in the current setup).

So if you can figure out some landmark selection algorithm that doesn't depend on the input sequence, you might be able to obtain a causal mask. You could obtain such a thing by making landmark selections a trainable parameter for example.

All of this is in search of a causal masking that respects linear complexity.

yyxiongzju commented 3 years ago

@thomasw21, the trick can be applied here for casual masking. But I am not sure if it will work out.

For the landmarks, we do not have to use all the input sequence to compute it. It depends on the input sequence. I thought about using standard self-attention to compute first k tokens. Then we use first k tokens to compute landmarks. I am not sure if a reasonable k will work out. This needs experimental results to support.

Adding trainable parameters to learn landmark selections is a reasonable way to do. While it introduces additional (linear) number of parameters, the landmarks (trainable parameters) will be fixed for the whole input dataset, which may lead to unfavorable performance. This also needs experimental results to verify.

It will be interesting to see some results by applying the above schemes.

thomasw21 commented 3 years ago

For the landmarks, we do not have to use all the input sequence to compute it. It depends on the input sequence. I thought about using standard self-attention to compute first k tokens. Then we use first k tokens to compute landmarks. I am not sure if a reasonable k will work out. This needs experimental results to support.

I see two small issues here:

I guess this would be work for GPT-like models (you just say that the model needs to have sequence length bigger than k). However this wouldn't work for like decoders in seq2seq model (you can't really go from 0 to k in one step, you can to increment 1 new token for each step).
I don't see how the first k tokens would have enough knowledge to create good landmarks for a given sequence. Nothing really tells you that the first k token are expressive enough in order to have a general sense of the sentence

Adding trainable parameters to learn landmark selections is a reasonable way to do. While it introduces additional (linear) number of parameters, the landmarks (trainable parameters) will be fixed for the whole input dataset, which may lead to unfavorable performance. This also needs experimental results to verify.

Yes I'd agree here. It's unclear to me if it would work. Yet if you look at Perceiver for example, they do have this trainable parameters in order to linearise their transformer.

It will be interesting to see some results by applying the above schemes.

+1

yyxiongzju commented 3 years ago

@thomasw21, the above scheme has those two issues. 1, For seq2seq model, the above scheme does not fit. I thought one new way to do it in the below response, which may be more suitable. 2, It does not guarantee if the first k tokens would have enough knowledge to represent a given sequence. I am not sure how large k is to make it work.

yyxiongzju commented 3 years ago

@miguelvictor, @thomasw21, One suitable way to apply Nystromformer to GPT2 and seq2seq model can be described here. For GPT2, the attention is softmax(QK^T + M)V, where M is a mask matrix. After applying M, softmax(QK^T) is a lower triangle matrix. In order to apply Nystromformer, we need to change the way to design M. In this case, we define 3 masks in Nystromformer, M_l, M_m, and M_r, which M_l is to make the left n x m matrix, softmax(Q\tilde{K}^T) to be a lower triangle matrix, M_m is to make the middle m x m matrix, softmax(\tilde{Q}\tilde(K}^T) to be a lower triangle matrix, M_r is to make the right m x n matrix, softmax(\tilde{Q}K^T) to be a lower triangle matrix. Since the lower triangle matrix, softmax(\tilde{Q}\tilde{K}^T) is invertible, its inverse is also a lower triangle matrix. In this case, all matrices in Nystromformer are lower-triangle, the output from Nystromformer is a lower triangle matrix. Therefore, Nystromformer can be used here with a linear cost.

For seq2seq, assume the first sequence length is n_1, and the second sequence length is n_2. The mask is to make (n1 + n 2) x (n_1 + n_2) softmax matrix to be like (S_1, 0; S_2, S_3), where S_1 \in R^{n_1 x n_1}, S_2 \in R^{n_2 x n_1}, and the lower triangle matrix S_3 \in R^{n_2 x n_2}. In order to make Nystromformer work here, we apply Nystromformer to compute S_1 by using the first sequence, S_2 by using the first sequence and the second sequence, the lower triangle S_3 by using the second sequence with a mask similar to the above GPT application. In this way, we can fit it here. Note that we do not need to form the (n_1 + n_2) x (n_1 + n_2) softmax matrix and then do the multiplication with V. We can use the similar trick in Nystromformer for the matrix multiplication to compute output. In this way, it achieves a linear cost self-attention computation.

yuzhenmao commented 1 year ago

Hi Yunyang,

Thank you very much for this amazing work! I am still unclear about the causal mask implementation in Nystromformer. I think even if you use causal masks when computing kernel 1&2&3, in https://github.com/mlpen/Nystromformer/blob/main/LRA/code/attention_nystrom.py#L38, when you compute the landmarks, the causal condition is still violated. But for now, I don't have a good idea about designing a causal mask for nystromformer.

Please correct me if I am wrong. Thank you again!

mlpen / Nystromformer

Does it support causal mask for GPT2-esque models? #7