iejMac / encoder-distill

Align embedding spaces of PyTorch encoders with common input types.
MIT License
4 stars 0 forks source link

Add MLP to teacher instead of student #15

Open iejMac opened 2 years ago

iejMac commented 2 years ago
  1. Freeze teacher
  2. add mlp on top
  3. train teacher again with this new mlp (for CLIP just do contrastive learning for a bit)
  4. distill this new teacher + MLP into student
iejMac commented 2 years ago

One big problem with continuing from distilled models of different sizes is that usually the bare models have different output embedding dimensionalities. For example CLIP L/14 outputs 768 and H/14 outputs 1024. What we currently do is take the student (H/14) and add an MLP on top to bring it down to the teachers (L/14) embedding space. But this is problematic because when you want to continue with contrastive fine-tuning you need to take that MLP off and exposes the true outputs of the student model which are not contrastively useful. This means for the first few (potential few hundred/thousand) steps the model needs to learn this property which might knock it out of it's good distilled initialization since those weights are not frozen.

A potential alternative would be to freeze the teacher model and add the MLP on top of it, then train the teacher contrastively just so the MLP finds some trivial solution to make the teacher output larger dimensional that are contrastively useful. This should be much simpler since the only trainable weights are inside the MLP. That way when we start distilling the teacher into the student we no longer need to take off anything after training since its output embedding is already contrastively useful. This means contrastive fine tuning should continue more smoothly.