Closed lucasosouza closed 3 years ago
Initial distillation mixin to transformers. Results with tiny bert are encouraging (<-- means "distilled from"):
Learning rate in the distillation experiment is two orders of magnitude higher, and we can probably increase it even more.
Missing a few things, main are:
Initial distillation mixin to transformers. Results with tiny bert are encouraging (<-- means "distilled from"):
Learning rate in the distillation experiment is two orders of magnitude higher, and we can probably increase it even more.
Missing a few things, main are: