Closed voidful closed 3 years ago
This is not a problem. When the model predicts the word next to "Ich" (given "Ich"), the word "Ich" cannot attend the words in the future positions (e.g., "will", "ein", etc). However, when the model predicts the word next to "ein" (given "Ich will ein"), the word "Ich" can attend "will" and "ein", which is not cheating. So, the word embeddings of "Ich" in the different right contexts are different.
This is not a problem. When the model predicts the word next to "Ich" (given "Ich"), the word "Ich" cannot attend the words in the future positions (e.g., "will", "ein", etc). However, when the model predicts the word next to "ein" (given "Ich will ein"), the word "Ich" can attend "will" and "ein", which is not cheating. So, the word embeddings of "Ich" in the different right contexts are different.
I agree this is true for transformer encoder models, but for decode models, due to 'casual mask', the left context should not be affected by the right context. That‘s why GPT "Ich" hidden will not be changed.
Therefore, I am curious why CausalLM models can not apply this rule.
This is not a problem. When the model predicts the word next to "Ich" (given "Ich"), the word "Ich" cannot attend the words in the future positions (e.g., "will", "ein", etc). However, when the model predicts the word next to "ein" (given "Ich will ein"), the word "Ich" can attend "will" and "ein", which is not cheating. So, the word embeddings of "Ich" in the different right contexts are different.
I agree this is true for transformer encoder models, but for decode models, due to 'casual mask', the left context should not be affected by the right context. That‘s why GPT "Ich" hidden will not be changed.
Therefore, I am curious why CausalLM models can not apply this rule.
This is not a problem. When the model predicts the word next to "Ich" (given "Ich"), the word "Ich" cannot attend the words in the future positions (e.g., "will", "ein", etc). However, when the model predicts the word next to "ein" (given "Ich will ein"), the word "Ich" can attend "will" and "ein", which is not cheating. So, the word embeddings of "Ich" in the different right contexts are different.
I think that the previous hidden state of the token should not change, since the change of the previous hidden state, there is no way to compute the loss with tokens in once in CausalLM
I was talking about decoder, not encoder. The attention masks vary according to a decoding step.
(In the following, "->" means "attends to") When the model predicts the next word given "Ich":
When the model predicts the next word given "Ich will ein":
Please see the "The Illustrated Masked Self-Attention" section in the following page. https://jalammar.github.io/illustrated-gpt2/
I was talking about decoder, not encoder. The attention masks vary according to a decoding step.
(In the following, "->" means "attends to") When the model predicts the next word given "Ich":
- "Ich" -> None
When the model predicts the next word given "Ich will ein":
- "Ich" -> "will" and "ein"
- "will" -> "Ich" and "ein"
- "ein" -> "Ich" and "will"
Please see the "The Illustrated Masked Self-Attention" section in the following page. https://jalammar.github.io/illustrated-gpt2/
https://huggingface.co/blog/encoder-decoder#decoder
auto-regressive models, such as GPT2, have the same architecture as transformer-based decoder models if one removes the cross-attention layer
On a side-note, autoencoding models, such as Bert, have the same architecture as transformer-based encoder models.
So, without involving cross-attention, the main difference between transformer encoder and decoder is that encoder uses bi-directional self-attention, decoder uses uni-directional self-attention layer instead.
Ich weight will attend to "will", but it's for "will" token weight, not for Ich token.
All the theory is right. I got the reason, it is because of the bias...
In from_pretrained
function, it will call model.eval() by default which will disable all the bias in model.
https://github.com/huggingface/transformers/blob/88a951e3cc00f56b94d9b93dbc35a3812cd88747/src/transformers/modeling_utils.py#L1190
However in from_config
, it won't call model.eval by default, so the result is affected by bias.
https://github.com/huggingface/transformers/blob/d26b37e744ea980977e266adf48736451b73c583/src/transformers/models/auto/modeling_auto.py#L750
Therefore, I suggest that we should call model.eval() in from_config
as same as from_pretrained
model.eval()
does not disable the bias in the model as far as I know. model.eval()
simply puts the model into "non training" mode meaning that dropout layers are not applied, etc.. . I don't think we need to add a model.eval()
to the from_config()
function.
model.eval()
does not disable the bias in the model as far as I know.model.eval()
simply puts the model into "non training" mode meaning that dropout layers are not applied, etc.. . I don't think we need to add amodel.eval()
to thefrom_config()
function.
I don't know why I said bias
😂, It should be dropout.
from_config() is more likely for training, so it should be fine not to add model.eval()
by default.
Thanks for your reply~
Problem
Causal Models is only attended to the left context. Therefore causal models should not depend on the right tokens. For example, The word embedding of "I" will be unchanged no matter what is in the right In GPT2. Since Causal Language Model are uni-directional self-attention.
Result
However, when it comes to other models, the result is not following the assumption, the logits will be changed when changing the right side input? What is the reason? Is it a bug? I really want to know the answer, thank you!
BERT
BART
Roberta
Experiment notebook colab
Environment info
transformers
version: 4.3.3Who can help
Information
Model I am using (GPT, Bert, RoBerta, BART ForCausalLM):
The problem arises when using:
To reproduce
Experiment notebook colab
Expected behavior
Causal Models should not be affected by the right context?