google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.11k stars 9.59k forks source link

Question: What does "pooler layer" mean? Why it called pooler? #1102

Open miyamonz opened 4 years ago

miyamonz commented 4 years ago

This question is just about the term "pooler", and maybe more of an English question than a question about BERT.

By reading this repository and its issues, I found the "pooler layer" is put after Transformer encoder stacks, ant it changes depends on the training task. but I can't understand why it is called "pooler".

I googled about the word "pooler" and "pooler layer", and it seems that this is not ML terminology.

BTW, The pooling layer, which appears on CNN something, is a similar word, but it seems to be a different thing.

ameet-1997 commented 4 years ago

I agree that the name pooler might be a little confusing. The BERT model can be divided into three parts for understanding it easily

  1. Embedding layer: Gets the embeddings from one-hot encodings of the words
  2. Encoder: This is the transformer with self attention heads
  3. Pooler: It takes the output representation corresponding to the first token and uses it for downstream tasks

In the paper which describes BERT, after passing a sentence through the model, the representation corresponding to the first token in the output is used for fine-tuning on tasks like SQuAD and GLUE. So the pooler layer does precisely that, applies a linear transformation over the representation of the first token. The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.

secsilm commented 4 years ago

I think it's ok to call "pooler" layer.

This layer transforms the output shape of the Transformer from [batch_size, seq_length, hidden_size] to [batch_size, hidden_size]. This is similar to GlobalMaxPool1D, but not maxpooling, only the first word directly.

So functionally speaking, this is the pooling.

guoxuxu commented 4 years ago

The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.

Hi, I have a question about this NSP purpose. Since the pooler is used on downstream tasks, like sentence classification, is it helpful to use a pooler that was trained for predicting the next sentence? Since the task now is to predict a label...

Thanks

amandalmia14 commented 3 years ago

@secsilm I understand that might be doing some kind of GlobalMaxPool1D, however do you know what exact algorithm they are using to reduce the dimension, I am afraid they are using "max" which is used in GlobalMaxPool1D.

Thanks

secsilm commented 3 years ago

@secsilm I understand that might be doing some kind of GlobalMaxPool1D, however do you know what exact algorithm they are using to reduce the dimension, I am afraid they are using "max" which is used in GlobalMaxPool1D.

Thanks

Not max. They just use the vector of first token to represent the whole sequence.

MonliH commented 1 year ago

Correct. For most tasks, the first token is a special token (such as [CLS] for classification tasks). This is why tokens like [CLS] are a thing.