voidful commented 3 years ago

Problem

Causal Models is only attended to the left context. Therefore causal models should not depend on the right tokens. For example, The word embedding of "I" will be unchanged no matter what is in the right In GPT2. Since Causal Language Model are uni-directional self-attention.

from transformers import AutoModel,AutoTokenizer, AutoConfig
import torch

# gpt
gpt_model = AutoModel.from_pretrained('gpt2')
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')
embeddings = gpt_model.get_input_embeddings()

# create ids of encoded input vectors
decoder_input_ids = gpt_tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids

# pass decoder input_ids and encoded input vectors to decoder
lm_logits = gpt_model(decoder_input_ids).last_hidden_state

# change the decoder input slightly
decoder_input_ids_perturbed = gpt_tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
lm_logits_perturbed = gpt_model(decoder_input_ids_perturbed).last_hidden_state

# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))

Result

Is encoding for `Ich` equal to its perturbed version?:  True

However, when it comes to other models, the result is not following the assumption, the logits will be changed when changing the right side input? What is the reason? Is it a bug? I really want to know the answer, thank you!

BERT

Is encoding for `Ich` equal to its perturbed version?:  False

BART

Is encoding for `Ich` equal to its perturbed version?:  False

Roberta

Is encoding for `Ich` equal to its perturbed version?:  False

Experiment notebook colab

Environment info

transformers version: 4.3.3
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.7.1+cu101 (False)
Tensorflow version (GPU?): 2.4.1 (False)
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

@patrickvonplaten
@LysandreJik
@patil-suraj

Information

Model I am using (GPT, Bert, RoBerta, BART ForCausalLM):

The problem arises when using:

[ x] the official example scripts: https://huggingface.co/blog/encoder-decoder#decoder

To reproduce

Experiment notebook colab

Expected behavior

Causal Models should not be affected by the right context?

tomohideshibata commented 3 years ago

This is not a problem. When the model predicts the word next to "Ich" (given "Ich"), the word "Ich" cannot attend the words in the future positions (e.g., "will", "ein", etc). However, when the model predicts the word next to "ein" (given "Ich will ein"), the word "Ich" can attend "will" and "ein", which is not cheating. So, the word embeddings of "Ich" in the different right contexts are different.

voidful commented 3 years ago

This is not a problem. When the model predicts the word next to "Ich" (given "Ich"), the word "Ich" cannot attend the words in the future positions (e.g., "will", "ein", etc). However, when the model predicts the word next to "ein" (given "Ich will ein"), the word "Ich" can attend "will" and "ein", which is not cheating. So, the word embeddings of "Ich" in the different right contexts are different.

I agree this is true for transformer encoder models, but for decode models, due to 'casual mask', the left context should not be affected by the right context. That‘s why GPT "Ich" hidden will not be changed.

Therefore, I am curious why CausalLM models can not apply this rule.

voidful commented 3 years ago

This is not a problem. When the model predicts the word next to "Ich" (given "Ich"), the word "Ich" cannot attend the words in the future positions (e.g., "will", "ein", etc). However, when the model predicts the word next to "ein" (given "Ich will ein"), the word "Ich" can attend "will" and "ein", which is not cheating. So, the word embeddings of "Ich" in the different right contexts are different.

I agree this is true for transformer encoder models, but for decode models, due to 'casual mask', the left context should not be affected by the right context. That‘s why GPT "Ich" hidden will not be changed.

Therefore, I am curious why CausalLM models can not apply this rule.

p208p2002 commented 3 years ago

This is not a problem. When the model predicts the word next to "Ich" (given "Ich"), the word "Ich" cannot attend the words in the future positions (e.g., "will", "ein", etc). However, when the model predicts the word next to "ein" (given "Ich will ein"), the word "Ich" can attend "will" and "ein", which is not cheating. So, the word embeddings of "Ich" in the different right contexts are different.

I think that the previous hidden state of the token should not change, since the change of the previous hidden state, there is no way to compute the loss with tokens in once in CausalLM

tomohideshibata commented 3 years ago

I was talking about decoder, not encoder. The attention masks vary according to a decoding step.

(In the following, "->" means "attends to") When the model predicts the next word given "Ich":

"Ich" -> None

When the model predicts the next word given "Ich will ein":

"Ich" -> "will" and "ein"
"will" -> "Ich" and "ein"
"ein" -> "Ich" and "will"

Please see the "The Illustrated Masked Self-Attention" section in the following page. https://jalammar.github.io/illustrated-gpt2/

voidful commented 3 years ago

I was talking about decoder, not encoder. The attention masks vary according to a decoding step.

(In the following, "->" means "attends to") When the model predicts the next word given "Ich":

"Ich" -> None

When the model predicts the next word given "Ich will ein":

"Ich" -> "will" and "ein"

"will" -> "Ich" and "ein"

"ein" -> "Ich" and "will"

Please see the "The Illustrated Masked Self-Attention" section in the following page. https://jalammar.github.io/illustrated-gpt2/

https://huggingface.co/blog/encoder-decoder#decoder

auto-regressive models, such as GPT2, have the same architecture as transformer-based decoder models if one removes the cross-attention layer

On a side-note, autoencoding models, such as Bert, have the same architecture as transformer-based encoder models.

So, without involving cross-attention, the main difference between transformer encoder and decoder is that encoder uses bi-directional self-attention, decoder uses uni-directional self-attention layer instead.

Ich weight will attend to "will", but it's for "will" token weight, not for Ich token.

voidful commented 3 years ago

All the theory is right. I got the reason, it is because of the bias...

In from_pretrained function, it will call model.eval() by default which will disable all the bias in model. https://github.com/huggingface/transformers/blob/88a951e3cc00f56b94d9b93dbc35a3812cd88747/src/transformers/modeling_utils.py#L1190

However in from_config, it won't call model.eval by default, so the result is affected by bias. https://github.com/huggingface/transformers/blob/d26b37e744ea980977e266adf48736451b73c583/src/transformers/models/auto/modeling_auto.py#L750

Therefore, I suggest that we should call model.eval() in from_config as same as from_pretrained

@patrickvonplaten
@LysandreJik
@patil-suraj

patrickvonplaten commented 3 years ago

model.eval() does not disable the bias in the model as far as I know. model.eval() simply puts the model into "non training" mode meaning that dropout layers are not applied, etc.. . I don't think we need to add a model.eval() to the from_config() function.

voidful commented 3 years ago

model.eval() does not disable the bias in the model as far as I know. model.eval() simply puts the model into "non training" mode meaning that dropout layers are not applied, etc.. . I don't think we need to add a model.eval() to the from_config() function.

I don't know why I said bias 😂, It should be dropout.

from_config() is more likely for training, so it should be fine not to add model.eval() by default.

Thanks for your reply~

huggingface / transformers

[Causal Language Modeling] seems not as expected #10564

Problem

Environment info

Who can help

Information

To reproduce

Expected behavior