Open eu9ene opened 3 months ago
We noticed that it's only around 30%. It's likely because the model is smaller than the teacher. We can try improving it by increasing the batch size.
In comparison, for teacher training:
We noticed that it's only around 30%. It's likely because the model is smaller than the teacher. We can try improving it by increasing the batch size.
In comparison, for teacher training: