Open miyamonz opened 4 years ago
I agree that the name pooler
might be a little confusing. The BERT model can be divided into three parts for understanding it easily
In the paper which describes BERT, after passing a sentence through the model, the representation corresponding to the first token in the output is used for fine-tuning on tasks like SQuAD and GLUE. So the pooler layer does precisely that, applies a linear transformation over the representation of the first token. The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.
I think it's ok to call "pooler" layer.
This layer transforms the output shape of the Transformer from [batch_size, seq_length, hidden_size]
to [batch_size, hidden_size]
. This is similar to GlobalMaxPool1D, but not maxpooling, only the first word directly.
So functionally speaking, this is the pooling.
The linear transformation is trained while using the Next Sentence Prediction (NSP) strategy.
Hi, I have a question about this NSP purpose. Since the pooler is used on downstream tasks, like sentence classification, is it helpful to use a pooler that was trained for predicting the next sentence? Since the task now is to predict a label...
Thanks
@secsilm I understand that might be doing some kind of GlobalMaxPool1D, however do you know what exact algorithm they are using to reduce the dimension, I am afraid they are using "max" which is used in GlobalMaxPool1D.
Thanks
@secsilm I understand that might be doing some kind of GlobalMaxPool1D, however do you know what exact algorithm they are using to reduce the dimension, I am afraid they are using "max" which is used in GlobalMaxPool1D.
Thanks
Not max. They just use the vector of first token to represent the whole sequence.
Correct. For most tasks, the first token is a special token (such as [CLS]
for classification tasks). This is why tokens like [CLS]
are a thing.
This question is just about the term "pooler", and maybe more of an English question than a question about BERT.
By reading this repository and its issues, I found the "pooler layer" is put after Transformer encoder stacks, ant it changes depends on the training task. but I can't understand why it is called "pooler".
I googled about the word "pooler" and "pooler layer", and it seems that this is not ML terminology.
BTW, The pooling layer, which appears on CNN something, is a similar word, but it seems to be a different thing.