lorenzkuhn / semantic_uncertainty

MIT License
136 stars 19 forks source link

Why using rouge-L > 0.3 as accuracy metric? #4

Closed jincan333 closed 1 year ago

jincan333 commented 1 year ago

Hi Lorenz:

I'm curious about why use rouge-L>0.3 as the criterion to judge the generative model's accuracy performance. Is rouge-L a common way to evaluate free-form QA tasks? Is there a better way to metric NLG models in free-form QA? And why choose 0.3 as the threshold?

lorenzkuhn commented 1 year ago

Hey @jincan333! Thank you for your interest in our work. In the paper, we show that this evaluation criterion correlates with human evaluations. There are other metrics such as BERTscore that might work well too, though!

jincan333 commented 1 year ago

Thank you! I also find the data in the paper~