Seahorse is a dataset for multilingual, multifaceted summarization evaluation. It contains 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets.
More details can be found in the paper, which can be cited as follows:
@misc{clark2023seahorse,
title={SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation},
author={Elizabeth Clark and Shruti Rijhwani and Sebastian Gehrmann and Joshua Maynez and Roee Aharoni and Vitaly Nikolaev and Thibault Sellam and Aditya Siddhant and Dipanjan Das and Ankur P. Parikh},
year={2023},
eprint={2305.13194},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
The Seahorse dataset is released under the CC-BY 4.0 license.
You can download the dataset here: https://storage.googleapis.com/seahorse-public/seahorse_data.zip
New!! We have also released the learnt metrics trained on Seahorse on HuggingFace. They can be found here.
The dataset is split into 3 .tsv files: the train, validation, and test sets.
Each file contains the following information:
gem_id
The ID corresponding to the article that was used to generate the summary (see Retrieving articles from GEM for more details)worker_lang
The language ID (de, es-ES, en-US, ru, tr, vi) summary
The generated summarymodel
The source of the summary (either reference or the summarization model)question1-6
6 columns with annotator ratings, corresponding to the 6 dimensions of quality (comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness). If question1
= No, then there will be no ratings for the remaining questions.Here is an example entry:
xlsum_english-validation-6416 en-US Schools in England, Wales and Scotland are being urged to bring back overseas exchange trips. t5_base Yes Yes Yes Yes Yes Yes
There is also a directory called duplicates
, which contains the items that received multiple annotations. Note that this data should NOT be used for training metrics, as there may be overlap between the train/dev/test sets.
If you would like to access the articles that the Seahorse summaries are based on, you will need to retrieve them using their GEM ids.
The xsum, mlsum, and xlsum articles can all be retrieved through GEM on HuggingFace. The gem_id
column points to the article in the GEM datasets.
The wikilingua article ids come from a previous version of the GEM dataset and should be retrieved using TensorFlow datasets. Here's an example of how to load the English wikilingua dataset into a dataframe:
import tensorflow_datasets as tfds
lang = 'english_en'
orig_split = 'validation'
ds, info = tfds.load(f'huggingface:gem/wiki_lingua_{lang}', split=orig_split, with_info=True)
hfdf = tfds.as_dataframe(ds,info)
Metrics trained on Seahorse are available through HuggingFace. They can be found here. These are mT5-based metrics in two sizes (Large and XXL), each trained on one of the six dimensions of quality. Please see the paper for more details about these metrics. | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | |
---|---|---|---|---|---|---|---|
mT5-XXL | seahorse-xxl-q1 | seahorse-xxl-q2 | seahorse-xxl-q3 | seahorse-xxl-q4 | seahorse-xxl-q5 | seahorse-xxl-q6 | |
mT5-Large | seahorse-large-q1 | seahorse-large-q2 | seahorse-large-q3 | seahorse-large-q4 | seahorse-large-q5 | seahorse-large-q6 |
Note: If you want to use the metrics for labeling summaries (as opposed to looking at correlations or ROC-AUC as we did in the paper), you will need to select a threshold value to classify the metric's scores. The best threshold value will depend on your data and use case.
We are maintaining a leaderboard with official results on our test set.
We ask you to not incorporate any part of the Seahorse validation set into the training data, and only use it for validation/hyperparameter tuning as development sets are typically used.
We report results on two metrics: Pearson correlation ($\rho$) and area under the ROC curve (roc).
Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | Link | $\rho$ | roc | $\rho$ | roc | $\rho$ | roc | $\rho$ | roc | $\rho$ | roc | $\rho$ | roc |
mT5-seahorse | [Clark et al. 2023] | 0.52 | 0.90 | 0.86 | 0.98 | 0.45 | 0.84 | 0.59 | 0.85 | 0.50 | 0.80 | 0.52 | 0.81 |
mT5-XNLI | [Honovich et al. 2022, Conneau et al. 2018] | - | - | - | - | - | - | 0.43 | 0.78 | - | - | - | - |
ROUGE-L | [Lin et al. 2004] | 0.04 | 0.54 | 0.06 | 0.54 | -0.03 | 0.43 | 0.13 | 0.55 | 0.03 | 0.54 | 0.02 | 0.54 |
Majority Class | - | - | 0.5 | - | 0.5 | - | 0.5 | - | 0.5 | - | 0.5 | - | 0.5 |
If you want to submit to the leaderboard, please send an email to the contact email below with your results.
Please email seahorse-authors@google.com if you have any questions about the dataset.