princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.39k stars 512 forks source link

Train supervised SimCSE which corpus is pair data with no hard negative #139

Closed MrRace closed 2 years ago

MrRace commented 2 years ago

I train supervised SimCSE with my own Chinese corpus which only have 2-column: pair data with no hard negative each line. I use the the script like run_sup_example.sh, but the result looks like worse than my Chinese un-supervised SimCSE when do text pairs similarity task. It seems not in line with my expectations, how can I try to improve it? Thanks a lot!

gaotianyu1350 commented 2 years ago

Hi,

Can you provide more details? For example, what are the data and what's the gap between the performance of the two methods?

MrRace commented 2 years ago

@gaotianyu1350 Thanks for your reply. My train data are same text pairs like :

好友早上好祝你开心每一天        老乡早上好开心每一天
早上好谢谢亲新年快乐    早上好新年快乐真好听
亲亲多谢支持    感谢冰冰支持美评
多谢姐姐厚礼鼓励和支持  感谢妹妹支持鼓励
姐早上好        姐夫早上好
做好自己就行    做最好的自己

I use 244579 pairs to do supervised SimCSE. And the un-supervised SimCSE is trained with 188927 line of text like:

去年的今天歌曲感谢大家聆听支持谢谢
一首老歌包含深情厚意来听听吧谢谢

The gap is huge, I use 2252 pairs to test the un-supervised SimCSE and supervised SimCSE. un-supervised SimCSE:

FN(false negative)=357
FP(false positive)=12

supervised SimCSE:

FN(false negative)=70
FP(false positive)=550
gaotianyu1350 commented 2 years ago

Can you check whether the format is correct (correct csv format, you can see our provided data as a reference)

MrRace commented 2 years ago

As the code:

    extension = data_args.train_file.split(".")[-1]
    if extension == "txt":
        extension = "text"
    if extension == "csv":
        logger.info("data_files={}".format(data_files))
        datasets = load_dataset(extension, data_files=data_files, cache_dir="./data/", delimiter="\t" if "tsv" in data_args.train_file else ",")
    else:
        datasets = load_dataset(extension, data_files=data_files, cache_dir="./data/")

Therefore my data format is like:

开心快乐每一天  快乐美一天
开心快乐美一天  快快乐乐每一天

The filename is whole.tsv.pre.csv and the sep char between the text pair is \t. It seems OK?

gaotianyu1350 commented 2 years ago

Hi,

If it is csv, the delimiter should be ","; in your case the extension name should be "tsv".