numenta / nupic.research

Experimental algorithms. Unsupported.
https://nupicresearch.readthedocs.io
GNU Affero General Public License v3.0
107 stars 60 forks source link

transformers: knowledge distillation mixin and initial experiments #483

Closed lucasosouza closed 3 years ago

lucasosouza commented 3 years ago

Initial distillation mixin to transformers. Results with tiny bert are encouraging (<-- means "distilled from"):

model num params train loss eval loss perplexity
mini_bert_100k 10,615,808 3.669 4.773 27.342
mini_bert_50k 10,615,808 4.464 5.815 56.309
tiny_bert_50k <-- bert_1mi 4,124,928 - 5.351 40.817
tiny_bert_100k 4,124,928 4.566 5.802 55.793
tiny_bert_50k 4,124,928 5.990 8.367 330.234

Learning rate in the distillation experiment is two orders of magnitude higher, and we can probably increase it even more.

Missing a few things, main are: