Why does my train result differ a lot from evaluation?

ilingen commented 1 year ago

When I train my simcse model by using run_unsup_example.sh, I got the best result as follows:

epoch = 1.0 eval_CR = 86.33 eval_MPQA = 88.02 eval_MR = 80.84 eval_MRPC = 72.52 eval_SST2 = 86.47 eval_SUBJ = 94.46 eval_TREC = 83.77 eval_avg_sts = 0.7706307325166646 eval_avg_transfer = 84.63 eval_sickr_spearman = 0.7318779427757625 eval_stsb_spearman = 0.8093835222575666

But when I run evaluate.py to test my trained model, it's eval_avg_sts score is only 0.7490, which is 2 points gap. I think it is an unacceptable loss. BTW, I also run

python evaluation.py --model_name_or_path result/unsup-concse-bert-base-uncased --pooler cls_before_pooler --task_set sts

, and get eval_avg_sts score is 76.25, same as the paper reported. So I wonder why my train result and evaluate result differ a lot.

gaotianyu1350 commented 1 year ago

Hi,

Please make sure you have converted the model format per README's instruction before you use evaluation.py.

ilingen commented 1 year ago

Hi, I have got the 75.7% spearman correlation by setting different random seeds and I have checked my previous resutls and found that different random seed had a large impact on the performance. I am going to close this issue and thanks for your reply.

princeton-nlp / SimCSE

Why does my train result differ a lot from evaluation? #203