Cannot reproduce results

hankook commented 3 years ago

Hello, and thank you for this useful code! I tried to reproduce the unsupervisd BERT+SimCSE results, but failed. My environment setup is as follows:

pytorch=1.7.1 (I also tested 1.8.0)
cudatoolkit=10.1
Single RTX 2080 Ti

The following script is the training script I used (exactly the same as run_unsup_example.sh).

python train.py \
    --model_name_or_path bert-base-uncased \
    --train_file data/wiki1m_for_simcse.txt \
    --output_dir result/my-unsup-simcse-bert-base-uncased \
    --num_train_epochs 1 \
    --per_device_train_batch_size 64 \
    --learning_rate 3e-5 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --eval_steps 125 \
    --pooler_type cls \
    --mlp_only_train \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --do_eval \
    --fp16 \
    "$@"

Then, I obtained following evaluation results:

$ python evaluation.py --model_name_or_path result/my-unsup-simcse-bert-base-uncased/ --pooler cls_before_pooler --task_set sts --mode test
(some log ...)
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 65.14 | 79.35 | 70.48 | 80.72 | 76.45 |    74.21     |      70.97      | 73.90 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

I also evaluated the pretrained model (this result are similar to results reported in #25):

$ python evaluation.py --model_name_or_path princeton-nlp/unsup-simcse-bert-base-uncased --pooler cls_before_pooler --task_set sts --mode test
(some log ...)
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 68.40 | 82.41 | 74.38 | 80.91 | 78.56 |    76.85     |      72.23      | 76.25 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

I think the gap (2.35 in average) is too large. Is there something wrong in the above training/evaluation scripts? How to obtain ~76 results in STS tasks?

yaoxingcheng commented 3 years ago

Hi, have you tried python simcse_to_huggingface.py --path result/my-unsup-simcse-bert-base-uncased/ to convert the model's state dict and config before evaluation?

hankook commented 3 years ago

Thank you for the quick response. I just tried to run the script before evaluation, but I obtained the same results..

$ python simcse_to_huggingface.py --path result/my-unsup-simcse-bert-base-uncased/
SimCSE checkpoint -> Huggingface checkpoint for result/my-unsup-simcse-bert-base-uncased/
$ python evaluation.py --model_name_or_path result/my-unsup-simcse-bert-base-uncased/ --pooler cls_before_pooler --task_set sts --mode test
(some log ...)
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 65.14 | 79.35 | 70.48 | 80.72 | 76.45 |    74.21     |      70.97      | 73.90 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

yaoxingcheng commented 3 years ago

In that case, I'm not quite sure how to interpret your results by now. I tried the scripts on google colab (pytorch=1.8.1 cuda=10.1 gpu=Tesla K80), and got an average performance of 75.20, similar to that reproduced in #25. Hopefully this will help. Also, it seems to me that the intrinsic difference between GPU devices may affect the performance by up to 1 point, and the optimal hyperparamters are likely to alter cross different devices. So I'll suggest to try some simple parameter tuning on batch size, learning rate and pooling method on your own device, and see whether the results get better.

hankook commented 3 years ago

Thank you for the kind response. When I removed --fp16 option, I obtained a similar result (but, task-specific performance is still different from the pretrained model). Although I have known RTX 2000 series support mixed-precision, I guess that there is some issue.

+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 69.15 | 82.25 | 74.72 | 81.63 | 78.63 |    78.39     |      69.97      | 76.39 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

gaotianyu1350 commented 3 years ago

Hi,

Thanks for reporting this. I think it is mainly caused by differences between GPU versions and CUDA versions. The last reproduced result looks good to me though (fp16 does make a lot of variance).

hankook commented 3 years ago

I think so. Experiments without fp16 would be better for reproducing the reported results and testing other variants. I'm now closing this issue. Thanks again :)

princeton-nlp / SimCSE

Cannot reproduce results #38