Part of fine-tuning is to change the network to pass relevant information from lower layers to the top layer where they are consumed by the task-specific module. With a limited amount of training data in the fine-tuning task, these changes may be difficult to achieve. It may be beneficial to give the task-specific component direct access to lower layers of BERT. For dependency parsing, layers 8 and 9 (out of 12) seem to be good when keeping BERT frozen (UDPipeFuture). Try these layers with our parser. In other words, prune the top layers before fine-tuning.
Part of fine-tuning is to change the network to pass relevant information from lower layers to the top layer where they are consumed by the task-specific module. With a limited amount of training data in the fine-tuning task, these changes may be difficult to achieve. It may be beneficial to give the task-specific component direct access to lower layers of BERT. For dependency parsing, layers 8 and 9 (out of 12) seem to be good when keeping BERT frozen (UDPipeFuture). Try these layers with our parser. In other words, prune the top layers before fine-tuning.