Try to use similarity as a loss

rom1504 commented 2 years ago

Mse (sim(new clip image, new clip text), sim(original clip image), sim (original clip text))

Could be completely instead of alignment or in addition to

iejMac commented 2 years ago

Pros:

more flexible than just alignment, will allow larger models to come up with better solutions for "teacher proposed similarities"
More likely to end up better

Cons:

You lose a lot of signal. Will likely take way more time to train.

Perhaps this is something in between contrastive from scratch and our standard distillation we're doing. This is like teacher guided contrastive learning. But in that case this makes more sense to me:

total_loss = contrastive_loss(student_embeddings) + MSE(student_embedding, teacher embedding)

That way if the H can find better solutions than the L the contrastive_loss(H) part will push it toward using it's independent solutions vs. using the teacher solutions in the MSE loss.

@rom1504

iejMac commented 2 years ago

Also in the proposed method above you'd probably like to decay the influence of the teacher over time, so actually it would be like:

total_loss = (1-gamma) contrastive_loss(student_embeddings) + gamma MSE(student_embedding, teacher embedding)

iejMac commented 2 years ago

Potential idea: instead of MSE between similarities just do contrastive loss but instead of identity matrix as label do matrix of teacher scores

iejMac / encoder-distill

Try to use similarity as a loss #14