lil-lab / newsroom

Tools for downloading and analyzing summaries and evaluating summarization systems. https://summari.es/
Other
147 stars 24 forks source link

Possible errors in ROUGE-L evaluation #28

Closed grusky closed 1 year ago

grusky commented 1 year ago

ROUGE-L is sensitive to sentence tokenization.

The data format used by the newsroom-run -> newsroom-score -> newsroom-tables evaluation pipeline does not appear to keep track of sentence tokenization. When sentence tokenization is not provided to ROUGE-1.5.5, multi-sentence references and hypotheses are evaluated as one long sentence. As a result, ROUGE-L scores produced by this evaluation pipeline (1) may be lower than expected compared to more standard ROUGE evaluation that uses tokenized sentences, and (2) probably do not match the ROUGE-L scores in the Newsroom paper, which are computed using sentence tokenization.

I would recommend adding a notice to the README that the evaluation pipeline does not keep track of sentence tokenization, which may result in lower-than-expected ROUGE-L scores, and that the newsroom-run -> newsroom-score -> newsroom-tables evaluation pipeline should not be used for publishable evaluation.

yoavartzi commented 1 year ago

Updated the README at the root and added a note on the evaluation directory. Thanks