huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.36k stars 26.87k forks source link

ROUGE_L score of summarization/t5 is very lower than that of paper. #4860

Closed takahiro971 closed 4 years ago

takahiro971 commented 4 years ago

🐛 Bug

Information

I try to use summarization/t5 in examples. The ROUGE_1 and ROUGE_2 is equals to that of google's paper. But, only ROUGE_L is very low!

ROUGE_1: paper=41.12 | my result=40.48 (almost equal)
ROUGE_2: paper=19.56 | my result=18.59 (almost equal)
ROUGE_L: paper=38.35 | my result=28.22 (very low ?)

Model I am using (Bert, XLNet ...): T5

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

1. 2. 3.

Expected behavior

Environment info

takahiro971 commented 4 years ago

So, I investigated the google's code, then I found:

https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/evaluation/metrics.py#L76

I think that they uses not "rougeL" but "rougeLsum". And also, they says: "# Add newlines between sentences so that rougeLsum is computed correctly."

https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/evaluation/metrics.py#L82

So, I tried the following hacks:

$ g log -p
commit 11bd4a086438b100c47e5e2b7e8696fcd67e94d1
Author: Takahiro Ito <65151988+takahiro971@users.noreply.github.com>
Date:   Tue Jun 9 14:35:19 2020 +0900

    スコア計算の不具合を修正

diff --git a/examples/summarization/t5/evaluate_cnn.py b/examples/summarization/t5/evaluate_cnn.py
index d2d6ee9..e1db944 100644
--- a/examples/summarization/t5/evaluate_cnn.py
+++ b/examples/summarization/t5/evaluate_cnn.py
@@ -44,17 +44,27 @@ def generate_summaries(lns, output_file_path, model_size, batch_size, device):

 def calculate_rouge(output_lns, reference_lns, score_path):
     score_file = Path(score_path).open("w")
-    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
+    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL", "rougeLsum"], use_stemmer=True)
     aggregator = scoring.BootstrapAggregator()

+    # copy from
+    # https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/evaluation/metrics.py#L80
+    def _prepare_summary(summary):
+        # Make sure the summary is not bytes-type
+        # Add newlines between sentences so that rougeLsum is computed correctly.
+        summary = summary.replace(" . ", " .\n")
+        return summary
+
     for reference_ln, output_ln in zip(reference_lns, output_lns):
+        reference_ln = _prepare_summary(reference_ln)
+        output_ln = _prepare_summary(output_ln)
         scores = scorer.score(reference_ln, output_ln)
         aggregator.add_scores(scores)

     result = aggregator.aggregate()
     score_file.write(
-        "ROUGE_1: \n{} \n\n ROUGE_2: \n{} \n\n ROUGE_L: \n{} \n\n".format(
-            result["rouge1"], result["rouge2"], result["rougeL"]
+        "ROUGE_1: \n{} \n\n ROUGE_2: \n{} \n\n ROUGE_L: \n{} \n\n ROUGE_Lsum: \n{} \n\n".format(
+            result["rouge1"], result["rouge2"], result["rougeL"], result["rougeLsum"]
         )
     )

, and I got a score (37.94), near paper score. Note that: the above my code shows both "rougeL" and "rougeLsum".

Question: Why don't your code use "rougeLsum" ?

https://github.com/huggingface/transformers/blob/master/examples/summarization/t5/evaluate_cnn.py#L47

I'm sorry, I'm not good at English. I hope some kind people fix this and create PR, thanks.

Best,

takahiro971 commented 4 years ago

P.S. the above hack is based on 41a1d27cdefd6417c298518198f99e3b8431a5c0:

$ gglv
* commit 11bd4a086438b100c47e5e2b7e8696fcd67e94d1 (HEAD, master)
| Author: Takahiro Ito <65151988+takahiro971@users.noreply.github.com>
| Date:   Tue Jun 9 14:35:19 2020 +0900
| 
|     スコア計算の不具合を修正
|  
* commit 41a1d27cdefd6417c298518198f99e3b8431a5c0 (origin/master, origin/HEAD)
| Author: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
| Date:   Mon Jun 8 21:22:37 2020 -0400
takahiro971 commented 4 years ago

Sorry, I accidentally closed issue ...

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.