microsoft / BANG

BANG is a new pretraining model to Bridge the gap between Autoregressive (AR) and Non-autoregressive (NAR) Generation. AR and NAR generation can be uniformly regarded as to what extent previous tokens can be attended, and BANG bridges AR and NAR generation by designing a novel model structure for large-scale pretraining. The pretrained BANG model can simultaneously support AR, NAR and semi-NAR generation to meet different requirements.
MIT License
27 stars 6 forks source link

How to get the Rouge scores in the paper? #1

Open pvop opened 2 years ago

pvop commented 2 years ago

Bang is a great paper. But I have some problems when i try to get the scores in the paper. First, the BLUE-4 and Rouge-L in Squad question generation by Mass pretrained model is different with the paper with beam size as 5,and get BLEU4 = 22.43, Rouge as:

1 ROUGE-1 Average_R: 0.48431 (95%-conf.int. 0.47986 - 0.48875) 1 ROUGE-1 Average_P: 0.54315 (95%-conf.int. 0.53853 - 0.54740) 1 ROUGE-1 Average_F: 0.49817 (95%-conf.int. 0.49411 - 0.50238)

1 ROUGE-2 Average_R: 0.26775 (95%-conf.int. 0.26311 - 0.27234) 1 ROUGE-2 Average_P: 0.29883 (95%-conf.int. 0.29365 - 0.30367) 1 ROUGE-2 Average_F: 0.27436 (95%-conf.int. 0.26965 - 0.27884)

1 ROUGE-L Average_R: 0.44690 (95%-conf.int. 0.44248 - 0.45166) 1 ROUGE-L Average_P: 0.49998 (95%-conf.int. 0.49532 - 0.50436) 1 ROUGE-L Average_F: 0.45929 (95%-conf.int. 0.45507 - 0.46371) BLUE4 is higher than paper, but Rouge-L is lower than paper.

Second, I use Bang pretrained model and the code in this repo, but my rouge scores lags behind the scores in the paper – with a gap about 2. and the generative quality is poor: when dominicn choe was as to , singapore , he to . northern ireland ' s euro 2016 qualifier ireland was after a crash . prime may says she has faith ' in ' trident nuclear after afire a the bbc . tennis police is investigating by a williams a at . a coast has been that to oil coast a public bonfire park bonfire belfast has been up a of bonfire the bbc has strongly a claims thatguana iling iguana was ' .

qiweizhen commented 2 years ago

Hi, thank you for your interest! Firstly the MASS scores are cited from the generation benchmark GLGE of MASS-base results on SQuAD question generation easy dataset. The evaluataion scripts are used from this link. Notice there the scripts are not complete, and rest files can be found in original SQuAD benchmark

Second, we will upload a new version of BANG v2 finetuning scripts, with detailed processed data files and scripts, then it will be very easy to reproduce, with obvious improvements than this version.