princeton-nlp / c-sts

[EMNLP 2023] C-STS: Conditional Semantic Textual Similarity
66 stars 7 forks source link

Cannot reproduce the performance shown in the paper #5

Closed BaixuanLi closed 9 months ago

BaixuanLi commented 11 months ago

Sorry to bother, but I could only get about 0.30 spearman's and Pearson's correlation, while I've set the hyperparameters according to the paper (simcse-base, cross-encoding, lr=3e-5, wd=0.1, epoch=3, transform=True), and as you can see, the result is not as predicted, so could you help me out about this? I'll be really thankful for that!!

BaixuanLi commented 11 months ago

As the correlation result of this model is around 0.38 in the paper.

carlosejimenez commented 11 months ago

Hi!

Are you using the training scripts provided in this repository? Do you mind printing exactly the command you're using to run/evaluate?

BaixuanLi commented 11 months ago

Sure~

image

I'm using the .sh file, and set the parameter according to the paper, and the screenshot shows all the hyperparameters that I've chose.

carlosejimenez commented 10 months ago

Hey Baixuan, thank you for your patience.

I reran with the following parameters (I believe the same you submitted):

model=${MODEL:-princeton-nlp/sup-simcse-roberta-base}  # pre-trained model
encoding=${ENCODER_TYPE:-cross_encoder}  # cross_encoder, bi_encoder, tri_encoder
lr=${LR:-3e-5}  # learning rate
wd=${WD:-0.1}  # weight decay
transform=${TRANSFORM:-True}  # whether to use an additional linear layer after the encoder
objective=${OBJECTIVE:-mse}  # mse, triplet, triplet_mse
triencoder_head=${TRIENCODER_HEAD:-None}  # hadamard, concat (set for tri_encoder)
seed=${SEED:-666}
output_dir=${OUTPUT_DIR:-output}
config=enc_${encoding}__lr_${lr}__wd_${wd}__trans_${transform}__obj_${objective}__tri_${triencoder_head}__s_${seed}
train_file=${TRAIN_FILE:-data/csts_train.csv}
eval_file=${EVAL_FILE:-data/csts_validation.csv}
test_file=${TEST_FILE:-data/csts_test.csv}

I ran 3 seeds with an NVIDIA A6000 and got the following results for TEST. Seed 42: MSE: 9.172729371293485, spearmanr: 0.3829023341280033, pearsonr: 0.38653424827388805

Seed 101: MSE: 9.109799315896288, spearmanr: 0.39012224501669596, pearsonr: 0.3933606861384604

Seed 666: MSE: 9.248459630892224, spearmanr: 0.36515925620807504, pearsonr: 0.367544032549672

Mean: MSE: 9.176996106027334, spearmanr: 0.3793946117842581, pearsonr: 0.3824796556540068

I also tried with an NVIDIA RTX 2080 and got the following: Seed 42: MSE: 9.14469821586109, spearmanr: 0.3940268436387526, pearsonr: 0.39658619051163446

Seed 101: MSE: 9.145529172505276, spearmanr: 0.383653880273925, pearsonr: 0.387261727044946

Seed 666: MSE: 9.223952018486575, spearmanr: 0.39191138026960576, pearsonr: 0.3953807255869861

Mean: MSE: 9.171393135617647, spearmanr: 0.3898640347274278, pearsonr: 0.39307621438118884

So these all seem pretty similar to the reported number, but there seems to be a bit of hardware dependent variance. Are you running on very different GPU?

Btw, note that there are some more parameters in run_sts.sh that could impact performance. For instance, I ran with the following (included by default in run_sts.sh):

  --pooler_type cls \
  --freeze_encoder False \
  --max_seq_length 512 \
  --condition_only False \
  --sentences_only False \
  --do_train \
  --do_eval \
  --do_predict \
  --evaluation_strategy epoch \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 4 \
  --max_grad_norm 0.0 \
  --num_train_epochs 3 \
  --lr_scheduler_type linear \
  --warmup_ratio 0.1 \
  --save_strategy epoch \
  --save_total_limit 1 \
  --fp16 True

So if you've changed any of those parameters, it could also impact performance.