Why removed layers related to 'cls' and 'pooler'?

Hi @iamxinxin, yes you can just fine-tune BERT for classification.

Why we remove `/pooler/` and `/cls/`?

All BERT implementations (TensorflowHub, HuggingFace) add an additional randomly initialized dense (projection) layer with tanh activation, dubbed pooler, in between the [CLS] final representation and the classification layer (a final dense layer mapping 768 real values to N classes). In the case of the TensorflowHub implementation, /cls/ refers to the classification layer used for the NSP pre-training task.

We found that the additional layer, i.e., pooler leads to slower converge and worse classification results. So, we remove this extra layer and follow by-the-book the article of Devlin et al. (2019) (https://arxiv.org/abs/1810.04805).

I quote ``The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. [...] the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.''.

iliaschalkidis / lmtc-eurlex57k