takahiro971 commented 4 years ago

🐛 Bug

Information

I try to use summarization/t5 in examples. The ROUGE_1 and ROUGE_2 is equals to that of google's paper. But, only ROUGE_L is very low!

ROUGE_1: paper=41.12 | my result=40.48 (almost equal)
ROUGE_2: paper=19.56 | my result=18.59 (almost equal)
ROUGE_L: paper=38.35 | my result=28.22 (very low ?)

Model I am using (Bert, XLNet ...): T5

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

[ ] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

1. 2. 3.

Expected behavior

Environment info

transformers version:
Platform: Cent OS 7 (64bit)
Python version: 3.7
PyTorch version (GPU?):
Tensorflow version (GPU?):
Using GPU in script?:
Using distributed or parallel set-up in script?:

takahiro971 commented 4 years ago

So, I investigated the google's code, then I found:

https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/evaluation/metrics.py#L76

I think that they uses not "rougeL" but "rougeLsum". And also, they says: "# Add newlines between sentences so that rougeLsum is computed correctly."

https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/evaluation/metrics.py#L82

So, I tried the following hacks:

$ g log -p
commit 11bd4a086438b100c47e5e2b7e8696fcd67e94d1
Author: Takahiro Ito <65151988+takahiro971@users.noreply.github.com>
Date:   Tue Jun 9 14:35:19 2020 +0900

    スコア計算の不具合を修正

diff --git a/examples/summarization/t5/evaluate_cnn.py b/examples/summarization/t5/evaluate_cnn.py
index d2d6ee9..e1db944 100644
--- a/examples/summarization/t5/evaluate_cnn.py
+++ b/examples/summarization/t5/evaluate_cnn.py
@@ -44,17 +44,27 @@ def generate_summaries(lns, output_file_path, model_size, batch_size, device):

 def calculate_rouge(output_lns, reference_lns, score_path):
     score_file = Path(score_path).open("w")
-    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
+    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL", "rougeLsum"], use_stemmer=True)
     aggregator = scoring.BootstrapAggregator()

+    # copy from
+    # https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/evaluation/metrics.py#L80
+    def _prepare_summary(summary):
+        # Make sure the summary is not bytes-type
+        # Add newlines between sentences so that rougeLsum is computed correctly.
+        summary = summary.replace(" . ", " .\n")
+        return summary
+
     for reference_ln, output_ln in zip(reference_lns, output_lns):
+        reference_ln = _prepare_summary(reference_ln)
+        output_ln = _prepare_summary(output_ln)
         scores = scorer.score(reference_ln, output_ln)
         aggregator.add_scores(scores)

     result = aggregator.aggregate()
     score_file.write(
-        "ROUGE_1: \n{} \n\n ROUGE_2: \n{} \n\n ROUGE_L: \n{} \n\n".format(
-            result["rouge1"], result["rouge2"], result["rougeL"]
+        "ROUGE_1: \n{} \n\n ROUGE_2: \n{} \n\n ROUGE_L: \n{} \n\n ROUGE_Lsum: \n{} \n\n".format(
+            result["rouge1"], result["rouge2"], result["rougeL"], result["rougeLsum"]
         )
     )

, and I got a score (37.94), near paper score. Note that: the above my code shows both "rougeL" and "rougeLsum".

Question: Why don't your code use "rougeLsum" ?

https://github.com/huggingface/transformers/blob/master/examples/summarization/t5/evaluate_cnn.py#L47

I'm sorry, I'm not good at English. I hope some kind people fix this and create PR, thanks.

Best,

takahiro971 commented 4 years ago

P.S. the above hack is based on 41a1d27cdefd6417c298518198f99e3b8431a5c0:

$ gglv
* commit 11bd4a086438b100c47e5e2b7e8696fcd67e94d1 (HEAD, master)
| Author: Takahiro Ito <65151988+takahiro971@users.noreply.github.com>
| Date:   Tue Jun 9 14:35:19 2020 +0900
| 
|     スコア計算の不具合を修正
|  
* commit 41a1d27cdefd6417c298518198f99e3b8431a5c0 (origin/master, origin/HEAD)
| Author: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
| Date:   Mon Jun 8 21:22:37 2020 -0400

takahiro971 commented 4 years ago

Sorry, I accidentally closed issue ...

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

huggingface / transformers

ROUGE_L score of summarization/t5 is very lower than that of paper. #4860

🐛 Bug

Information

To reproduce

Expected behavior

Environment info