How to create your own distilled model?

As BERT and language models in general are rather huge, it is worth thinking about smaller versions, especially when it comes to deployment. This heavily depends on the later use case where the models are applied. Whenever possible, larger models should give a better result whilst using up more resources.

The basic functionality of distilled models follows a teacher-student architecture where attention heads are removed (detailed knowledge should be acquired).

Open questions:

How complex are these to implement?
Is the effort worth it?
In which contexts could these models be useful?

tarrade / proj_multilingual_text_classification

How to create your own distilled model? #50