transformers: knowledge distillation mixin and initial experiments

Initial distillation mixin to transformers. Results with tiny bert are encouraging (<-- means "distilled from"):

model	num params	train loss	eval loss	perplexity
mini_bert_100k	10,615,808	3.669	4.773	27.342
mini_bert_50k	10,615,808	4.464	5.815	56.309
tiny_bert_50k <-- bert_1mi	4,124,928	-	5.351	40.817
tiny_bert_100k	4,124,928	4.566	5.802	55.793
tiny_bert_50k	4,124,928	5.990	8.367	330.234

Learning rate in the distillation experiment is two orders of magnitude higher, and we can probably increase it even more.

Missing a few things, main are:

unit tests
combining target with teacher softmax outputs (there is a small bug to fix there, likely related to the extra context dimension)
a lot more experimentation with higher LR schedules and ensembles

numenta / nupic.research