Closed MrRace closed 2 years ago
Hi,
Can you provide more details? For example, what are the data and what's the gap between the performance of the two methods?
@gaotianyu1350 Thanks for your reply. My train data are same text pairs like :
好友早上好祝你开心每一天 老乡早上好开心每一天
早上好谢谢亲新年快乐 早上好新年快乐真好听
亲亲多谢支持 感谢冰冰支持美评
多谢姐姐厚礼鼓励和支持 感谢妹妹支持鼓励
姐早上好 姐夫早上好
做好自己就行 做最好的自己
I use 244579 pairs to do supervised SimCSE. And the un-supervised SimCSE is trained with 188927 line of text like:
去年的今天歌曲感谢大家聆听支持谢谢
一首老歌包含深情厚意来听听吧谢谢
The gap is huge, I use 2252 pairs to test the un-supervised SimCSE and supervised SimCSE. un-supervised SimCSE:
FN(false negative)=357
FP(false positive)=12
supervised SimCSE:
FN(false negative)=70
FP(false positive)=550
Can you check whether the format is correct (correct csv format, you can see our provided data as a reference)
As the code:
extension = data_args.train_file.split(".")[-1]
if extension == "txt":
extension = "text"
if extension == "csv":
logger.info("data_files={}".format(data_files))
datasets = load_dataset(extension, data_files=data_files, cache_dir="./data/", delimiter="\t" if "tsv" in data_args.train_file else ",")
else:
datasets = load_dataset(extension, data_files=data_files, cache_dir="./data/")
Therefore my data format is like:
开心快乐每一天 快乐美一天
开心快乐美一天 快快乐乐每一天
The filename is whole.tsv.pre.csv
and the sep char between the text pair is \t
. It seems OK?
Hi,
If it is csv, the delimiter should be ","; in your case the extension name should be "tsv".
I train supervised SimCSE with my own Chinese corpus which only have 2-column: pair data with no hard negative each line. I use the the script like
run_sup_example.sh
, but the result looks like worse than my Chinese un-supervised SimCSE when do text pairs similarity task. It seems not in line with my expectations, how can I try to improve it? Thanks a lot!