Open rom1504 opened 2 years ago
Pros:
Cons:
Perhaps this is something in between contrastive from scratch and our standard distillation we're doing. This is like teacher guided contrastive learning. But in that case this makes more sense to me:
total_loss = contrastive_loss(student_embeddings) + MSE(student_embedding, teacher embedding)
That way if the H can find better solutions than the L the contrastive_loss(H) part will push it toward using it's independent solutions vs. using the teacher solutions in the MSE loss.
@rom1504
Also in the proposed method above you'd probably like to decay the influence of the teacher over time, so actually it would be like:
total_loss = (1-gamma) contrastive_loss(student_embeddings) + gamma MSE(student_embedding, teacher embedding)
Potential idea: instead of MSE between similarities just do contrastive loss but instead of identity matrix as label do matrix of teacher scores
Mse (sim(new clip image, new clip text), sim(original clip image), sim (original clip text))
Could be completely instead of alignment or in addition to