Closed takahiro971 closed 4 years ago
So, I investigated the google's code, then I found:
I think that they uses not "rougeL" but "rougeLsum". And also, they says: "# Add newlines between sentences so that rougeLsum is computed correctly."
So, I tried the following hacks:
$ g log -p
commit 11bd4a086438b100c47e5e2b7e8696fcd67e94d1
Author: Takahiro Ito <65151988+takahiro971@users.noreply.github.com>
Date: Tue Jun 9 14:35:19 2020 +0900
スコア計算の不具合を修正
diff --git a/examples/summarization/t5/evaluate_cnn.py b/examples/summarization/t5/evaluate_cnn.py
index d2d6ee9..e1db944 100644
--- a/examples/summarization/t5/evaluate_cnn.py
+++ b/examples/summarization/t5/evaluate_cnn.py
@@ -44,17 +44,27 @@ def generate_summaries(lns, output_file_path, model_size, batch_size, device):
def calculate_rouge(output_lns, reference_lns, score_path):
score_file = Path(score_path).open("w")
- scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
+ scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL", "rougeLsum"], use_stemmer=True)
aggregator = scoring.BootstrapAggregator()
+ # copy from
+ # https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/evaluation/metrics.py#L80
+ def _prepare_summary(summary):
+ # Make sure the summary is not bytes-type
+ # Add newlines between sentences so that rougeLsum is computed correctly.
+ summary = summary.replace(" . ", " .\n")
+ return summary
+
for reference_ln, output_ln in zip(reference_lns, output_lns):
+ reference_ln = _prepare_summary(reference_ln)
+ output_ln = _prepare_summary(output_ln)
scores = scorer.score(reference_ln, output_ln)
aggregator.add_scores(scores)
result = aggregator.aggregate()
score_file.write(
- "ROUGE_1: \n{} \n\n ROUGE_2: \n{} \n\n ROUGE_L: \n{} \n\n".format(
- result["rouge1"], result["rouge2"], result["rougeL"]
+ "ROUGE_1: \n{} \n\n ROUGE_2: \n{} \n\n ROUGE_L: \n{} \n\n ROUGE_Lsum: \n{} \n\n".format(
+ result["rouge1"], result["rouge2"], result["rougeL"], result["rougeLsum"]
)
)
, and I got a score (37.94), near paper score. Note that: the above my code shows both "rougeL" and "rougeLsum".
Question: Why don't your code use "rougeLsum" ?
I'm sorry, I'm not good at English. I hope some kind people fix this and create PR, thanks.
Best,
P.S. the above hack is based on 41a1d27cdefd6417c298518198f99e3b8431a5c0:
$ gglv
* commit 11bd4a086438b100c47e5e2b7e8696fcd67e94d1 (HEAD, master)
| Author: Takahiro Ito <65151988+takahiro971@users.noreply.github.com>
| Date: Tue Jun 9 14:35:19 2020 +0900
|
| スコア計算の不具合を修正
|
* commit 41a1d27cdefd6417c298518198f99e3b8431a5c0 (origin/master, origin/HEAD)
| Author: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
| Date: Mon Jun 8 21:22:37 2020 -0400
Sorry, I accidentally closed issue ...
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
🐛 Bug
Information
I try to use summarization/t5 in examples. The ROUGE_1 and ROUGE_2 is equals to that of google's paper. But, only ROUGE_L is very low!
Model I am using (Bert, XLNet ...): T5
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
1. 2. 3.
Expected behavior
Environment info
transformers
version: