maszhongming / MatchSum

Code for ACL 2020 paper: "Extractive Summarization as Text Matching"
520 stars 108 forks source link

Not getting the same rouge score #19

Open annismith2016 opened 4 years ago

annismith2016 commented 4 years ago

Hello

For the multinews dataset, I took your rouge evaluation code, and also your generated output. I am not getting the same result:

Here is your report result:

---------------------------------------------
1 ROUGE-1 Average_R: 0.49376 (95%-conf.int. 0.49086 - 0.49702)
1 ROUGE-1 Average_P: 0.46147 (95%-conf.int. 0.45854 - 0.46440)
1 ROUGE-1 Average_F: 0.46223 (95%-conf.int. 0.45993 - 0.46440)
---------------------------------------------
1 ROUGE-2 Average_R: 0.17810 (95%-conf.int. 0.17541 - 0.18108)
1 ROUGE-2 Average_P: 0.16330 (95%-conf.int. 0.16084 - 0.16581)
1 ROUGE-2 Average_F: 0.16502 (95%-conf.int. 0.16262 - 0.16764)
---------------------------------------------
1 ROUGE-L Average_R: 0.44680 (95%-conf.int. 0.44412 - 0.44996)
1 ROUGE-L Average_P: 0.41897 (95%-conf.int. 0.41602 - 0.42185)
1 ROUGE-L Average_F: 0.41903 (95%-conf.int. 0.41680 - 0.42119)

and here is what I got:


---------------------------------------------
1 ROUGE-1 Average_R: 0.48863 (95%-conf.int. 0.48578 - 0.49186)
1 ROUGE-1 Average_P: 0.45658 (95%-conf.int. 0.45363 - 0.45953)
1 ROUGE-1 Average_F: 0.45736 (95%-conf.int. 0.45512 - 0.45955)
---------------------------------------------
1 ROUGE-2 Average_R: 0.17679 (95%-conf.int. 0.17413 - 0.17974)
1 ROUGE-2 Average_P: 0.16207 (95%-conf.int. 0.15965 - 0.16456)
1 ROUGE-2 Average_F: 0.16378 (95%-conf.int. 0.16145 - 0.16635)
---------------------------------------------
1 ROUGE-L Average_R: 0.44234 (95%-conf.int. 0.43963 - 0.44541)
1 ROUGE-L Average_P: 0.41468 (95%-conf.int. 0.41180 - 0.41758)
1 ROUGE-L Average_F: 0.41479 (95%-conf.int. 0.41250 - 0.41693)

No offense, but do you know what makes that difference?

maszhongming commented 3 years ago

I re-test the ROUGE score and still get the same result as I reported. I'm not sure what caused this discrepancy. If you can't get the same results on other datasets as well, there may be a problem with pyrouge installation. Have you passed the official test of pyrouge? Besides, you can compare your RELEASE-1.5.5/sample-output with some output on the Internet. Are their ROUGE scores the same?