princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.36k stars 507 forks source link

when add do_mlm #174

Closed Jason-kid closed 2 years ago

Jason-kid commented 2 years ago

when i add do_mlm flag, cosine similarity will become larger even two sentences are irrelevant?

gaotianyu1350 commented 2 years ago

Hi,

The MLM training is know for causing "representation degeneration", which means the embeddings of different words/sentences become similar.

Jason-kid commented 2 years ago

thank for you kind apply ! well, i have an another question. i have a dataset consist of a line of two sentence, which not labels whether they are similar or not . for the task above , i try to train the model in the unsupervised way, where i group the sentA and sentB into one sentence for the unsupervised training. Do you have some kind advice ? thanks a lot .