Closed jincan333 closed 1 year ago
Hey @jincan333! Thank you for your interest in our work. In the paper, we show that this evaluation criterion correlates with human evaluations. There are other metrics such as BERTscore that might work well too, though!
Thank you! I also find the data in the paper~
Hi Lorenz:
I'm curious about why use rouge-L>0.3 as the criterion to judge the generative model's accuracy performance. Is rouge-L a common way to evaluate free-form QA tasks? Is there a better way to metric NLG models in free-form QA? And why choose 0.3 as the threshold?