princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.33k stars 505 forks source link

How do I use SimCSE on my own dataset? #189

Closed skye95git closed 1 year ago

skye95git commented 2 years ago

I'm doing search task and the pre-training model I'm using is RoBerta Base. I would like to join SimCSE on this basis, how to use SimCSE on my own data set?

gaotianyu1350 commented 2 years ago

Hi,

If you want to train SimCSE on your own dataset, you can simply replace our training data with your own in the same format. And we already provided an example script in readme.

adhb22 commented 1 year ago

Hi,

If you want to train SimCSE on your own dataset, you can simply replace our training data with your own in the same format. And we already provided an example script in readme.

Hi, I guess you mean we can prepare data and use the shell script to train own model. But I wonder how to use the installed module (pip install simcse) to train own model.

gaotianyu1350 commented 1 year ago

Hi,

The pip package cannot be used to train your own model. To do this you need to use this github repo and follow the readme.

TomasAndersonFang commented 1 year ago

Hi,

If you want to train SimCSE on your own dataset, you can simply replace our training data with your own in the same format. And we already provided an example script in readme.

Hi, can I train and evaluate SimCSE on my own datasets? Although I can train it on my dataset by setting "--train_file", I don't know how to evaluate SimSCE on my test set. It seems that SimCSE can only evaluate on some specific tasks according to your source code.

gaotianyu1350 commented 1 year ago

Hi,

We use a modified version of SentEval for evaluation. For your own evaluation file you can modify the SentEval part of code. You will have to implement your own evaluation protocol if you want to do a HIT@N (retrieval style) type of evaluation. This repo might be helpful for retrieval-style evaluation: https://github.com/castorini/pyserini