How did you reimplement the bart baseline model?

Hi, I‘m recently working on bart fine-tuned on CNNDM, I downloaded the fine tuned Bart.large.cnn from fairseq , and evaluating the model using the code presented in the repository, but got rouge scores of 42.31, 20.21, 39.31 instead of 44.16, 21.28, 40.90 presented in readme. Since I used the downloaded model and the code from the repository to do the evaluation, I am confused why the result can be so different. I have noticed that your paper also report a baseline result of bart.large.cnn with rouge scores of 43.98 21.07 40.82, may I ask about how did you implement the model and got this result? Is there anything I miss when reproducing the result? Thanks!

icml-2020-nlp / semsim

How did you reimplement the bart baseline model? #8