princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.33k stars 505 forks source link

Supervised SimCSE for datasets with different degrees of similarity. #205

Closed EeyoreLee closed 1 year ago

EeyoreLee commented 1 year ago
Hi, thanks for your efficient work. There is a question that what should I do if the dataset has some different degrees of similarity. like that sentence1 sentence2 similarity
A1 A2 0.8
A1 A3 0.7
B1 B2 0.82
B1 B3 0.7

I saw there is a hard negative weight in SimCSE. Should I use it to give the low similarity a higher punishment. or I make some changes from loss. Like Facol Loss, but not it , to give the high similarity a higher weight. Looking forward to your reply and understanding of the matter. Thanks in advance! :)

gaotianyu1350 commented 1 year ago

Hi,

For now our framework does not support real-number similarities from the supervision data. You can either adjust the code to set different weights on different data examples, or truncate the similarity and only see pairs with high-enough similarities as positive pairs.

EeyoreLee commented 1 year ago

Thanks for your reply. In fact, I wanna sort a set of sentences by their similarity to another text. So the real-number similarities seems like important. Therefore, "truncate the similarity and only see pairs with high-enough similarities as positive pairs." may not be an suitable way. Like that, you also agree that shouldn't use the hard negative and should adjust the code to support the idea?

gaotianyu1350 commented 1 year ago

Hi,

Yeah, hard negatives won't suit your need. Maybe contrastive learning is not a good objective in your case, and you can probably use a regression objective.

EeyoreLee commented 1 year ago

Hi,

Yeah, hard negatives won't suit your need. Maybe contrastive learning is not a good objective in your case, and you can probably use a regression objective.

Thanks. I will try some and feedback here if I adjust SimCSE and it's useful.