Role of top layers in fine-tuning BERT

Assumption: Fine-tuning BERT for dependency parsing changes the top layers in such a way as to make the information from the lower layers available to the parser and also to outsource some of the processing that normally happens in the Bi-LSTM layers of the parser to the top layers of BERT.

Idea for testing this:

Re-initialise the parameters of the top 3 or so layers randomly,
freeze the BERT layers that have not been re-initialised,
train the top layers on the fine-tuning task,
unfreeze BERT and
fine-tune all layers as usual.

If this performs just as well as the standard procedure this would mean that the information in the top layers from pre-training does not contribute anything to the final task.

jbrry / Irish-BERT

Role of top layers in fine-tuning BERT #94