princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.4k stars 512 forks source link

Failed to reproduce the results in unsupervised settings #82

Closed tianylin98 closed 3 years ago

tianylin98 commented 3 years ago

Hi, I tried running the script in this repo with the following command

export CUDA_VISIBLE_DEVICES="1" && bash run_unsup_example.sh

after the model is trained, the automatic evaluation shows the following results:

08/29/2021 08:29:05 - INFO - root -   Evaluating...
08/29/2021 08:29:05 - INFO - __main__ -   ***** Eval results *****
08/29/2021 08:29:05 - INFO - __main__ -     epoch = 1.0
08/29/2021 08:29:05 - INFO - __main__ -     eval_CR = 86.98
08/29/2021 08:29:05 - INFO - __main__ -     eval_MPQA = 88.42
08/29/2021 08:29:05 - INFO - __main__ -     eval_MR = 83.79
08/29/2021 08:29:05 - INFO - __main__ -     eval_MRPC = 72.91
08/29/2021 08:29:05 - INFO - __main__ -     eval_SST2 = 86.47
08/29/2021 08:29:05 - INFO - __main__ -     eval_SUBJ = 99.65
08/29/2021 08:29:05 - INFO - __main__ -     eval_TREC = 81.81
08/29/2021 08:29:05 - INFO - __main__ -     eval_avg_sts = 0.7749235656339977
08/29/2021 08:29:05 - INFO - __main__ -     eval_avg_transfer = 85.71857142857144
08/29/2021 08:29:05 - INFO - __main__ -     eval_sickr_spearman = 0.7408230149520894
08/29/2021 08:29:05 - INFO - __main__ -     eval_stsb_spearman = 0.809024116315906

I believe the results so far look pretty normal.

But when I evaluate the checkpoint using the recommended command,

python evaluation.py    \
     --model_name_or_path result/my-unsup-simcse-bert-base-uncased \
     --pooler cls   \
     --task_set sts    \
     --mode test

the numbers look a little abnormal:

***** Transfer task : STSBenchmark*****

2021-08-29 08:30:54,720 : train : pearson = 0.4718, spearman = 0.5396
2021-08-29 08:30:56,033 : dev : pearson = 0.4987, spearman = 0.6017
2021-08-29 08:30:57,143 : test : pearson = 0.4298, spearman = 0.5634
2021-08-29 08:30:57,152 : ALL : Pearson = 0.4729,             Spearman = 0.5616
2021-08-29 08:30:57,152 : ALL (weighted average) : Pearson = 0.4698,             Spearman = 0.5542
2021-08-29 08:30:57,152 : ALL (average) : Pearson = 0.4668,             Spearman = 0.5682

2021-08-29 08:30:57,169 :

***** Transfer task : SICKRelatedness*****

2021-08-29 08:31:00,348 : train : pearson = 0.6276, spearman = 0.6400
2021-08-29 08:31:00,734 : dev : pearson = 0.6166, spearman = 0.6831
2021-08-29 08:31:04,107 : test : pearson = 0.6256, spearman = 0.6392
2021-08-29 08:31:04,116 : ALL : Pearson = 0.6258,             Spearman = 0.6419
2021-08-29 08:31:04,116 : ALL (weighted average) : Pearson = 0.6260,             Spearman = 0.6418
2021-08-29 08:31:04,116 : ALL (average) : Pearson = 0.6232,             Spearman = 0.6541

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 50.73 | 62.48 | 52.55 | 60.54 | 60.87 |    56.34     |      63.92      | 58.20 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

For comparison, I also run evaluation using the higgingface-hosted checkpoint:

+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 66.04 | 81.49 | 73.61 | 79.73 | 78.12 |    76.52     |      71.86      | 75.34 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Did I do something wrong / misunderstand something here?

gaotianyu1350 commented 3 years ago

Hi,

For the unsupervised model, you should use --pooler cls_before_pooler for the evaluation.

tianylin98 commented 3 years ago

Thanks for the reply. The numbers look normal now.

korawat-tanwisuth commented 3 years ago

@boredtylin

Thanks for the reply. The numbers look normal now.

Do you mind letting me know what numbers you got?

tianylin98 commented 3 years ago

@boredtylin

Thanks for the reply. The numbers look normal now.

Do you mind letting me know what numbers you got?

for example:

+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 70.06 | 81.01 | 75.25 | 82.06 | 76.53 |    77.01     |      71.24      | 76.17 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
korawat-tanwisuth commented 3 years ago

@boredtylin I used the command they provided. I set CUDA_VISIBLE_DEVICES=0 when running the bash command. I also changed the pooler option to --pooler_before_cls. Do you know what else we need to change to get the result? I appreciate your help.

tianylin98 commented 3 years ago

@boredtylin I used the command they provided. I set CUDA_VISIBLE_DEVICES=0 when running the bash command. I also changed the pooler option to --pooler_before_cls. Do you know what else we need to change to get the result? I appreciate your help.

I think your results are normal. The evaluation does show some variation in the results. Even with the same hyper-parameters, i do sometimes get results lower than yours (e.g., Avg=73). I think the authors only report the result for the best random seed.

You might wanna try varying the random seed by setting --seed in the training script (e.g., run_unsup_example.sh)