Open vmmm123 opened 2 years ago
Hi, thanks for your question!
The loss function will try to increase the similarity between an embedding v and its positive prototype c: v \dot c / phi. When phi is larger, v \dot c also needs to be larger in order to increase the similarity. Therefore, the embedding becomes closer to the prototype.
ok, it is a direct thought. I try to understand it from the angle of gradient and i am afraid that the larger gradient may force the model more focus on the tight cluster when / phi is smaller.
In the paper, you have mentioned "With the proposed φ, the similarity in a loose cluster (larger φ) are down-scaled, pulling embeddings closer to the prototype", but i am wondering why the down-scaled similarity can force them get closer? Could you please explain it more detailedly? Thanks!