princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.37k stars 511 forks source link

Question About bad results from trained model #240

Closed Alison-starbeat closed 1 year ago

Alison-starbeat commented 1 year ago

Sorry to bother you! I'm new to NLP, and I tried to use unsupervised simCSE on my own data, and the goal is to achieve best recall scores and precise scores (there is a test dataset) on my own data. I tried to use 10,000 - 90,000 data,1-2 epoches,learning_rate of 1e-5,with batch_size of 64 for training, and use a base model(roformer-sim Chinese version). But I found that the results from the trained model was worse than the base model.

I guess that questions happens in my datasets, maybe my dataset concludes lots of similiar sentence pairs naturally, which may causes bad influence to the contrastive learning step. Could this be true? What could I do to improve the results?

Thank you for your patience and hope for your reply!

gaotianyu1350 commented 1 year ago

Hi, can you elaborate more on the issue? For example, what this dataset is about, what the baseline model is, etc.

github-actions[bot] commented 1 year ago

Stale issue message