In #771 I tested the effects of reducing the distillation data to understand that expensive part of our pipeline. However, we should do it again for the base student model, as the other one was done for a tiny model too see if there is a difference. Also, I want to test it on a morphologically more complex language like Lithuanian.
In #771 I tested the effects of reducing the distillation data to understand that expensive part of our pipeline. However, we should do it again for the
base
student model, as the other one was done for atiny
model too see if there is a difference. Also, I want to test it on a morphologically more complex language like Lithuanian.