We'll need some brief, easy-to-follow explanations of most sections. Assume the reader does not know what KD is.
Explain the purpose of distinguishing a temp student model and a final student model. It looks to me as though the temp student is for a baseline, and the final student uses KD. Describe this using math in markdown.
It is unclear what the difference between "Teacher Sinusoid" and "Teacher Model Prediction" is, as well as the student cases
It is unclear what the problem is. "Given X, do Y."
Performance issues:
It seems like the teacher performs worse than the student. This is causing the student +KD to fail. An alternative idea is to compress all your models into 2-3 layers (same depth), but giving the student few neurons (256?) and teacher far more (1024?) in the hidden layer. I'm not too sure what these numbers should be, but the large depth might be causing the teacher to underperform.
If KD doesn't work in the end, pivot this tutorial to prediction only. Define the problem clearly.
KDTT. First Draft.