iliaschalkidis / lmtc-eurlex57k

Large-Scale Multi-Label Text Classification on EU Legislation
Apache License 2.0
92 stars 10 forks source link

Why removed layers related to 'cls' and 'pooler'? #13

Closed iamxinxin closed 3 years ago

iamxinxin commented 3 years ago

Thanks for sharing!! I'm confused about removing layers related to 'cls' and 'pooler' in neural_networks/layers/bert.py /build() ` # Remove unused layers and set trainable parameters

self.trainable_weights += [var for var in self.bert.variables if not "/cls/" in var.name and not "/pooler/" in var.name]`

Can I just fine-tune the original bert for classification ?

iliaschalkidis commented 3 years ago

Hi @iamxinxin, yes you can just fine-tune BERT for classification.

Why we remove /pooler/ and /cls/?

All BERT implementations (TensorflowHub, HuggingFace) add an additional randomly initialized dense (projection) layer with tanh activation, dubbed pooler, in between the [CLS] final representation and the classification layer (a final dense layer mapping 768 real values to N classes). In the case of the TensorflowHub implementation, /cls/ refers to the classification layer used for the NSP pre-training task.

We found that the additional layer, i.e., pooler leads to slower converge and worse classification results. So, we remove this extra layer and follow by-the-book the article of Devlin et al. (2019) (https://arxiv.org/abs/1810.04805).

I quote ``The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. [...] the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.''.