Open iejMac opened 2 years ago
One big problem with continuing from distilled models of different sizes is that usually the bare models have different output embedding dimensionalities. For example CLIP L/14 outputs 768 and H/14 outputs 1024. What we currently do is take the student (H/14) and add an MLP on top to bring it down to the teachers (L/14) embedding space. But this is problematic because when you want to continue with contrastive fine-tuning you need to take that MLP off and exposes the true outputs of the student model which are not contrastively useful. This means for the first few (potential few hundred/thousand) steps the model needs to learn this property which might knock it out of it's good distilled initialization since those weights are not frozen.
A potential alternative would be to freeze the teacher model and add the MLP on top of it, then train the teacher contrastively just so the MLP finds some trivial solution to make the teacher output larger dimensional that are contrastively useful. This should be much simpler since the only trainable weights are inside the MLP. That way when we start distilling the teacher into the student we no longer need to take off anything after training since its output embedding is already contrastively useful. This means contrastive fine tuning should continue more smoothly.