Open pvop opened 2 years ago
Hi, thank you for your interest! Firstly the MASS scores are cited from the generation benchmark GLGE of MASS-base results on SQuAD question generation easy dataset. The evaluataion scripts are used from this link. Notice there the scripts are not complete, and rest files can be found in original SQuAD benchmark
Second, we will upload a new version of BANG v2 finetuning scripts, with detailed processed data files and scripts, then it will be very easy to reproduce, with obvious improvements than this version.
Bang is a great paper. But I have some problems when i try to get the scores in the paper. First, the BLUE-4 and Rouge-L in Squad question generation by Mass pretrained model is different with the paper with beam size as 5,and get BLEU4 = 22.43, Rouge as:
1 ROUGE-1 Average_R: 0.48431 (95%-conf.int. 0.47986 - 0.48875) 1 ROUGE-1 Average_P: 0.54315 (95%-conf.int. 0.53853 - 0.54740) 1 ROUGE-1 Average_F: 0.49817 (95%-conf.int. 0.49411 - 0.50238)
1 ROUGE-2 Average_R: 0.26775 (95%-conf.int. 0.26311 - 0.27234) 1 ROUGE-2 Average_P: 0.29883 (95%-conf.int. 0.29365 - 0.30367) 1 ROUGE-2 Average_F: 0.27436 (95%-conf.int. 0.26965 - 0.27884)
1 ROUGE-L Average_R: 0.44690 (95%-conf.int. 0.44248 - 0.45166) 1 ROUGE-L Average_P: 0.49998 (95%-conf.int. 0.49532 - 0.50436) 1 ROUGE-L Average_F: 0.45929 (95%-conf.int. 0.45507 - 0.46371) BLUE4 is higher than paper, but Rouge-L is lower than paper.
Second, I use Bang pretrained model and the code in this repo, but my rouge scores lags behind the scores in the paper – with a gap about 2. and the generative quality is poor: when dominicn choe was as to , singapore , he to . northern ireland ' s euro 2016 qualifier ireland was after a crash . prime may says she has
faith ' in ' trident nuclear after afire a the bbc . tennis police is investigating by a williams a at . a coast has been that to oil coast a public bonfire park bonfire belfast has been up a of bonfire the bbc has strongly a claims thatguana iling iguana was
' .