how the model reflect 'bidirectional'?

HaodaY commented 5 years ago

BERT is 'Bidirectional Encoder Representations from Transformers', so how the transformer reflect 'bidirectional', and why GPT don't?

xwzhong commented 5 years ago

here is some explanation: https://github.com/google-research/bert/issues/83

libertatis commented 5 years ago

You can find the description in the paper https://arxiv.org/abs/1810.04805: "We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation." So the BERT Transformer is the "Transformer encoder" with bidirectional self-attention where every token can attend context to its left and right. While the GPT Transformer is the "Transformer decoder" with constrained self-attention where every token can only attend to context to its left.

JaneShenYY commented 5 years ago

@libertatis The self-attention in Transformer calculates the weighted value vector at a position t by multiplying the query t with the key in each position and etc. I think it attends both the left and right words. Why it's said by the BERT authors that it's uni-directional? Although BERT paper claims itself as bi-directional, which part does this job and how the bi-directional is implemented in the self-attention?

hsm207 commented 5 years ago

@JaneShenYY yes, the self attention layer attends to all the words to its left and right and itself too. This is the part of the network that makes it bidirectional.

chikubee commented 5 years ago

@hsm207 If all the tokens can look at the left, right context and itself, why is it that the [CLS] token carries the sentence representation? How is it learning that representation, some insights on the internal working? Thanks in advance

hsm207 commented 5 years ago

@chikubee The [CLS] token will be prepended to every input sentence. So, on the first layer, the representation of [CLS] is a function of the [CLS] token itself and all other tokens to the right. This pattern repeats until you reach the last transformer layer. I hope you can see that the [CLS] token would have multiple opportunities to look at an input sentence left and right since the token representations it depends on is looking at the sentences left and right. This means that the [CLS] token representation at the final layer can be considered a rich representation of the input sentence.

[CLS] token carries the sentence representation in sentence classification tasks because this is the token whose representation is finetuned to the task at hand. We don't pick any other token as the sentence representation because the same token have different representation depending on its location. For example, the representation for the word "the" in the "the cat in the hat" is different than in "I like the cat". We also don't pick the n-th token as the representation because it won't be handle cases where the input sentence's lengh is less than n.

So, to make things easy for us, let's just tack on a dummy token (which we will call [CLS]) to every input sentence. This way, we can be sure that we always have a token whose representation is simply a function of the other tokens in the input sentence and not its position.

I hope this clarifies. Let me know if you have further questions.

chikubee commented 5 years ago

@chikubee The [CLS] token will be prepended to every input sentence. So, on the first layer, the representation of [CLS] is a function of the [CLS] token itself and all other tokens to the right. This pattern repeats until you reach the last transformer layer. I hope you can see that the [CLS] token would have multiple opportunities to look at an input sentence left and right since the token representations it depends on is looking at the sentences left and right. This means that the [CLS] token representation at the final layer can be considered a rich representation of the input sentence.

[CLS] token carries the sentence representation in sentence classification tasks because this is the token whose representation is finetuned to the task at hand. We don't pick any other token as the sentence representation because the same token have different representation depending on its location. For example, the representation for the word "the" in the "the cat in the hat" is different than in "I like the cat". We also don't pick the n-th token as the representation because it won't be handle cases where the input sentence's lengh is less than n.

So, to make things easy for us, let's just tack on a dummy token (which we will call [CLS]) to every input sentence. This way, we can be sure that we always have a token whose representation is simply a function of the other tokens in the input sentence and not its position.

I hope this clarifies. Let me know if you have further questions.

This was a really good explanation @hsm207, clarifies a lot, Thanks. What I still fail to understand is that is it really a good representation of the sentence, for when I try to check for similar sentences to interpret the false positives for Text Classification task, the results tell otherwise for some cases. Can you share some insights on sentence similarity? Or the correct way to go by it is at the token level cross computation BERTScore.

Thanks again.

hsm207 commented 5 years ago

@chikubee

Could you give some examples about your text classification use case and how are you checking for similar sentences to interpret the false positives?

BERTScore is it meant to evaluate the quality of a machine generated text e.g. machine translation, image captioning. I don't see how this metric is applicable in Text Classification since the outputs are class labels, not pieces of tokens.

wj-Mcat commented 3 years ago

@hsm207 It's a great explanation for why is cls leart sentence representation, thanks a lot.

google-research / bert

how the model reflect 'bidirectional'? #319