Summarization metrics - Githubissues

sebastianruder / NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

https://nlpprogress.com/

MIT License

22.72k stars 3.62k forks source link

Summarization metrics #104

Closed ellej16 closed 6 years ago

ellej16 commented 6 years ago

Hi guys! I've observed that researches featured for summarization mostly describes evaluating summaries only in the following metrics:

Rouge / variants
Meteor
Compression Ratio (CR)
F1 score

And recalling from past researches, I see they are the most often used.

Does anyone have an idea why these are favored over the other metrics? specifically:

Retention Ratio (RR) [ Hassel M, Evaluation of automatic text summarization, 2004]
- Answer Recall Lenient (ARL) [Mani 2002 , TIPSTER SUMMAC Text Summarization Evaluation ]
- Answer Recall Strict (ARS) [Mani 2002 , TIPSTER SUMMAC Text Summarization Evaluation ]

Among others? (I mentioned RR because I used it previously along with CR)

Thanks for this great Repo btw!

ellej16 commented 6 years ago

Tagging summarization.md contributors , sorry for the bother! @jfsantos @shashiongithub @FredRodrigues @sebastianruder

sebastianruder commented 6 years ago

Hey @ellej16, could you be more explicit how Retention Ratio is used as a metric? In Hassel (2004), it is only defined as information in Summary / information in Full Text.

ellej16 commented 6 years ago

Hi @sebastianruder ! On a past research we used Answer Recall Average [mani 2002] to define information is in the summary, via answering certain questions based on the full text. A respondent is tasked beforehand to create the said Q&As on the full text (information in Full text)

sebastianruder commented 6 years ago

Cool. Given your description, it seems that Answer Recall Average is a lot more expensive to evaluate, particularly on a large scale, as you require human answers for every text. I think that's similar to human evaluation vs. BLEU in Machine Translation and arguably is the main reason why automatic metrics are preferred.

ellej16 commented 6 years ago

Thank you very much for your time and insight on this one!

Also, huge thanks for this repository (Glad to see summarization still having research interest!)