Closed vluechinger closed 4 years ago
From the two links below, I would add to the statement above: During BERT fine-tuning, none of the layers are frozen. This means that (with BERT base) all 12 pre-trained layers are trained again with the new task-specific parameters.
This could be the difference we have been looking for. For CV models, it may be more common to specify at least a couple frozen layers.
https://yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/ https://www.quora.com/What-is-the-difference-between-transfer-learning-and-fine-tuning
I think this medium post cover everything: https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a
some simple answer from here as well: https://stats.stackexchange.com/questions/343763/fine-tuning-vs-transferlearning-vs-learning-from-scratch
In general, transfer learning refers to the process of pre-training a model on a specific task which can be kept general (in the case of nlp to get a good sense of a language) or very specific. Transfer learning comes in when this pre-trained model is used further down the line for a different task and/or a different domain. The pre-trained model can be fine-tuned on a new dataset and a new training task which basically means that the weights resulting from pre-training are used as a starting ground for the fine-tuning model. This is done by attaching a head on top of the pre-trained model which satisfies the target task.
This process can be extremely helpful if data is scarce in the target domain or if the target task requires a complex understanding of the underlying concepts. Models and data from domains with more abundance can be used to create better models for the specific problem.
There are lots of different types of transfer learning in NLP. The type we focus on is sequential transfer learning since we both change the domain and the task. For more information: https://ruder.io/thesis/neural_transfer_learning_for_nlp.pdf
Presumably, there is a difference in how to look at transfer learning in computer vision and natural language processing. In CV, the description above should match. For NLP however, it is disputable whether only the weights from the pre-trained model are carried over. The process of fine-tuning the model simply takes too long if the only thing that got computed was the classification head. Our hypothesis is that BERT itself (as the pre-trained model) is re-trained to some extent. It is clear, however, that BERT is not fully retrained which would take up way too much time.
@tarrade: Can you review this and enhance it with your thoughts?