Assumption: Fine-tuning BERT for dependency parsing changes the top layers in such a way as to make the information from the lower layers available to the parser and also to outsource some of the processing that normally happens in the Bi-LSTM layers of the parser to the top layers of BERT.
Idea for testing this:
Re-initialise the parameters of the top 3 or so layers randomly,
freeze the BERT layers that have not been re-initialised,
train the top layers on the fine-tuning task,
unfreeze BERT and
fine-tune all layers as usual.
If this performs just as well as the standard procedure this would mean that the information in the top layers from pre-training does not contribute anything to the final task.
Assumption: Fine-tuning BERT for dependency parsing changes the top layers in such a way as to make the information from the lower layers available to the parser and also to outsource some of the processing that normally happens in the Bi-LSTM layers of the parser to the top layers of BERT.
Idea for testing this:
If this performs just as well as the standard procedure this would mean that the information in the top layers from pre-training does not contribute anything to the final task.
See also issue #93.