Hello! I am trying to reproduce the BLEU Score number that you mentioned in your CodeReviewer paper. The fine-tuning script you provided for the "review comment generation" downstream task, mentions the train steps to be 60,000. When applying the fine-tuned model to your own test dataset, I found the BLEU score to be 5.16, while the paper mentions it to be 5.32.
I used the exact same shell scripts that have been uploaded in this GitHub repository for CodeReviewer. My question is, what is the exact train steps count or train step checkpoint that you used to produce the score (5.32) mentioned in the paper? What might be the possible reasons for this mismatch?
Since the training process is not a definite process, the mismatch of 0.16 is acceptable. You can try to evaluate other checkpoints and might get one with a closer BLEU score.
Hello! I am trying to reproduce the BLEU Score number that you mentioned in your CodeReviewer paper. The fine-tuning script you provided for the "review comment generation" downstream task, mentions the train steps to be 60,000. When applying the fine-tuned model to your own test dataset, I found the BLEU score to be 5.16, while the paper mentions it to be 5.32.
I used the exact same shell scripts that have been uploaded in this GitHub repository for CodeReviewer. My question is, what is the exact train steps count or train step checkpoint that you used to produce the score (5.32) mentioned in the paper? What might be the possible reasons for this mismatch?