Closed iamxinxin closed 3 years ago
Hi @iamxinxin, yes you can just fine-tune BERT for classification.
/pooler/
and /cls/
?All BERT implementations (TensorflowHub, HuggingFace) add an additional randomly initialized dense (projection) layer with tanh activation, dubbed pooler
, in between the [CLS]
final representation and the classification layer (a final dense layer mapping 768 real values to N classes). In the case of the TensorflowHub implementation, /cls/
refers to the classification layer used for the NSP pre-training task.
We found that the additional layer, i.e., pooler
leads to slower converge and worse classification results. So, we remove this extra layer and follow by-the-book the article of Devlin et al. (2019) (https://arxiv.org/abs/1810.04805).
I quote ``The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. [...] the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.''.
Thanks for sharing!! I'm confused about removing layers related to 'cls' and 'pooler' in neural_networks/layers/bert.py /build() ` # Remove unused layers and set trainable parameters
self.trainable_weights += [var for var in self.bert.variables if not "/cls/" in var.name and not "/pooler/" in var.name]`
Can I just fine-tune the original bert for classification ?