Maybe part of the problem of continuing from the distilled checkpoint is because the final task doesn't allow for a lot of deviation from L weights. To allow for more freedom we can do an alignment style distillation loss in the lower layers of the network and only do MSE(similarities) at the final layer.
Maybe part of the problem of continuing from the distilled checkpoint is because the final task doesn't allow for a lot of deviation from L weights. To allow for more freedom we can do an alignment style distillation loss in the lower layers of the network and only do MSE(similarities) at the final layer.