princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.35k stars 507 forks source link

Troubles reproducing the results #100

Closed ypuzikov closed 2 years ago

ypuzikov commented 2 years ago

Hi, folks! Thank you very much for the hard work (^^) I have a question on how to reproduce the results -- not that I am aiming to spot the differences, just making sure that I am running the code correctly.

I use the run_run_unsup_example.sh script to train the unsupervised SimCSE. At the end of the training procedure, I run evaluation as follows: time CUDA_VISIBLE_DEVICES=0 python evaluation.py --model_name_or_path result/my-unsup-simcse-bert-base-uncased --pooler cls --task_set sts --mode test. The results table I am getting is:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 46.88 | 56.47 | 58.33 | 65.43 | 58.92 |    56.71     |      55.36      | 56.87 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

I believe the table i should be comparing to is Table 5 from the paper, the relevant row shows:

∗SimCSE-BERTbase | 68.40 | 82.41 | 74.38 | 80.91 | 78.56 | 76.85 | 72.23 | 76.25|

Which is far better than what i get. Can you maybe help me understand if I am doing smth wrong? I follow the main README.md file, the content of the run_run_unsup_example.sh script is:

python train.py \
    --model_name_or_path bert-base-uncased \
    --train_file data/wiki1m_for_simcse.txt \
    --output_dir result/my-unsup-simcse-bert-base-uncased \
    --num_train_epochs 1 \
    --per_device_train_batch_size 64 \
    --learning_rate 3e-5 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --eval_steps 125 \
    --pooler_type cls \
    --mlp_only_train \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --do_eval \
    --fp16 \
    "$@"
ypuzikov commented 2 years ago

After reading the closed issues, I tried converting the weights to the HF checkpoint. Got slightly better results, but by not much:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 50.92 | 55.15 | 57.24 | 67.02 | 60.67 |    58.67     |      55.69      | 57.91 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
ypuzikov commented 2 years ago

I saw the issue #38 which suggests disabling the --fp16 option. Will retrain the model and report back later.

ypuzikov commented 2 years ago

Reporting back:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 50.06 | 55.05 | 57.67 | 67.48 | 60.69 |    57.63     |      56.02      | 57.80 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Results with the --pooler cls_before_pooler flag:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 49.49 | 55.75 | 58.12 | 66.50 | 60.36 |    57.28     |      56.59      | 57.73 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Environment:

gaotianyu1350 commented 2 years ago

Hi, you should use the cls_before_pooler for evaluation (and that is where most not-reproducible cases come from). But since you also tried that, I suggest you to evaluate our pre-trained checkpoints first (to make sure nothing goes wrong with the evaluation). You can also check the log for training, where validation results are printed.

ypuzikov commented 2 years ago

Hi! Running python evaluation.py --model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased --pooler cls --task_set sts --mode test gives me the following results:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 75.30 | 84.67 | 80.19 | 85.40 | 80.82 |    84.26     |      80.39      | 81.58 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

That matches the expected result :)

By the way, running the same command, but w/ the cls_before_pooler option gives:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 77.75 | 81.43 | 79.53 | 85.59 | 81.98 |    83.68     |      79.52      | 81.35 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

In any case, it seems like the results from your pretrained models are computed as expected, so eval code works. Any hints on where the problem w/ the train code might be?

gaotianyu1350 commented 2 years ago

Can you check the validation results for debugging?

ypuzikov commented 2 years ago

Even compared to the test scores in Table 5, the validation results on dev set are fairly low -- here are the highest scores I got:

{'eval_stsb_spearman': 0.6589651676962802, 'eval_sickr_spearman': 0.595184228171256, 'eval_avg_sts': 0.6270746979337681, 'epoch': 0.62}
gaotianyu1350 commented 2 years ago

The validation result doesn't seem right, so there is indeed something going wrong during the training. Since it's hard to debug for me, can you try to re-clone the github repo, redownload all the data and train again (just to make sure code/data are not corrupted)?

ypuzikov commented 2 years ago

Did that and also asked another person to do that -- same result “¯_(ツ)_/¯“ Can you confirm that a fresh clone results in smth different for you?

lirenhao1997 commented 2 years ago

I ran into exactly the same question as @ypuzikov, please have a look at your latest repo, thanks!

lirenhao1997 commented 2 years ago

I ran into exactly the same question as @ypuzikov, please have a look at your latest repo, thanks!

I just delete the previous contents in result folder and rerun the script run_unsup_example.sh. This model gives me the following results:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 68.99 | 82.87 | 73.33 | 79.40 | 77.32 |    75.63     |      69.58      | 75.30 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

It seems like an acceptable results compare with the results reported in the paper though I still don't know the cause of the problem I came across yesterday...

gaotianyu1350 commented 2 years ago

This looks reasonable to me. @ypuzikov Can you try @SincereAlex's suggestion, deleting all the result and running it again?

ypuzikov commented 2 years ago

Yeah, will do that in the next couple of days and come back. Although, I have to say, the solution looks more like magic to me -- why would wiping the directory improve the results?

ypuzikov commented 2 years ago

Hey, everyone! It took a bit longer than a couple of days :) Anyway, the hack did not work -- here are my results after wiping out the results folder and re-training the model:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 49.49 | 55.75 | 58.12 | 66.50 | 60.36 |    57.28     |      56.59      | 57.73 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
gaotianyu1350 commented 2 years ago

Hi, I just cloned the github repo and rerun the experiment, and I can still reproduce the result. I'm not sure what went wrong. Could you please try it out on some other environments/machines?

TJKlein commented 2 years ago

Unfortunately, I also cannot reproduce the results reported. Using the current GitHub repo. I ran the script:

sh run_unsup_example.sh

I evaluated on different machines:

 python evaluation.py     --model_name_or_path result/my-unsup-simcse-bert-base-uncased     --pooler cls     --task_set sts     --mode test

and I get this result:

AWS Deep Learning AMI (Ubuntu 18.04) Version 47.0 V100 (p3.2xlage) Transformers 4.2.1 PyTorch 1.9.0 CUDA Version: 11.0

------ test ------
 +-------+-------+-------+-------+-------+--------------+-----------------+-------+
 | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
 +-------+-------+-------+-------+-------+--------------+-----------------+-------+
 | 69.98 | 82.50 | 73.83 | 81.88 | 79.41 |    78.23     |      70.58      | 76.63 |
 +-------+-------+-------+-------+-------+--------------+-----------------+-------+

+-------+-------+-------+-------+-------+-------+-------+-------+
|   MR  |   CR  |  SUBJ |  MPQA |  SST2 |  TREC |  MRPC |  Avg. |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 82.88 | 89.20 | 94.81 | 89.67 | 87.31 | 88.40 | 73.51 | 86.54 |
+-------+-------+-------+-------+-------+-------+-------+-------+

And on another machine:

AWS Deep Learning AMI (Ubuntu 18.04) Version 50.0 V100 (p3.2xlage) Transformers 4.2.1 PyTorch 1.7.1 CUDA Version: 11.0

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 68.05 | 80.38 | 72.62 | 78.96 | 76.90 |    75.11     |      69.37      | 74.48 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
+-------+-------+-------+-------+-------+-------+-------+-------+
|   MR  |   CR  |  SUBJ |  MPQA |  SST2 |  TREC |  MRPC |  Avg. |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 80.74 | 85.75 | 93.96 | 88.60 | 84.57 | 86.20 | 73.51 | 84.76 |
+-------+-------+-------+-------+-------+-------+-------+-------+

AWS Deep Learning AMI (Ubuntu 18.04) Version 47.0 V100 (p3.2xlage) Transformers 4.10.0 dev PyTorch 1.7.1 CUDA Version: 11.0

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 70.06 | 80.64 | 73.70 | 80.86 | 76.93 |    75.60     |      71.20      | 75.57 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
+-------+-------+-------+-------+-------+-------+-------+-------+
|   MR  |   CR  |  SUBJ |  MPQA |  SST2 |  TREC |  MRPC |  Avg. |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 80.74 | 86.68 | 94.27 | 88.97 | 86.16 | 85.40 | 73.51 | 85.10 |
+-------+-------+-------+-------+-------+-------+-------+-------+
gaotianyu1350 commented 2 years ago

Hi,

The first results are even higher than the reported numbers in the paper right? What is the issue here?