The reported scores of GENIE are not fair

microsoft / ProphetNet

A research project for natural language generation, containing the official implementations by MSRA NLC team.

MIT License

651 stars 104 forks source link

Hi @qiweizhen,

I have a question about your evaluation.

From your paper: "In the inference process, we randomly sample 10 Gaussian noises for iteration denoising, and use the highest score as the final generated result." I also check your file https://github.com/microsoft/ProphetNet/blob/master/GENIE/integration/eval_split.py.

For each source sentence, you generate 10 hypotheses. And then you compute the Rouge score between each hypothesis and target sentence. Finally, you take the hypothesis with the best score as the final generation. You do this for each source sentence and combine all hypotheses with the best score as the whole generation file.

My question is: is it a fair or standard way for generation? For inference, the target sentences are blind. We can't use them as a hint for generation.

Thank you for your interest in our work.

The results in the main table are not fair enough to compare, which is also mentioned in the Chapter 4.5. Strictly speaking, there is currently no very fair and rigorous method to compare AR and diffusion. However, these experiments can reflect the potential of diffusion models to generate comparable AR effects, while also reflecting general trends.

In fact, we recognize this problem and propose an fair evaluation method in this article, using LLM to evaluate 10 samples generated by AR and 10 samples generated by GENIE. From the results shown in Table 4 and Table 5, it can be seen that the overall quality of the diffusion model is slightly lower than AR, but diffusion model can generate more diverse samples, which is also very important in practical applications of text generation.

microsoft / ProphetNet

The reported scores of GENIE are not fair #57