Seahorse dataset and metrics

Seahorse is a dataset for multilingual, multifaceted summarization evaluation. It contains 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets.

More details can be found in the paper, which can be cited as follows:

@misc{clark2023seahorse,
      title={SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation}, 
      author={Elizabeth Clark and Shruti Rijhwani and Sebastian Gehrmann and Joshua Maynez and Roee Aharoni and Vitaly Nikolaev and Thibault Sellam and Aditya Siddhant and Dipanjan Das and Ankur P. Parikh},
      year={2023},
      eprint={2305.13194},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

The Seahorse dataset is released under the CC-BY 4.0 license.

You can download the dataset here: https://storage.googleapis.com/seahorse-public/seahorse_data.zip

New!! We have also released the learnt metrics trained on Seahorse on HuggingFace. They can be found here.

Dataset description

The dataset is split into 3 .tsv files: the train, validation, and test sets.

Each file contains the following information:

gem_id The ID corresponding to the article that was used to generate the summary (see Retrieving articles from GEM for more details)
worker_lang The language ID (de, es-ES, en-US, ru, tr, vi)
summary The generated summary
model The source of the summary (either reference or the summarization model)
question1-6 6 columns with annotator ratings, corresponding to the 6 dimensions of quality (comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness). If question1= No, then there will be no ratings for the remaining questions.

Here is an example entry:

xlsum_english-validation-6416   en-US   Schools in England, Wales and Scotland are being urged to bring back overseas exchange trips.   t5_base Yes Yes Yes Yes Yes Yes

There is also a directory called duplicates, which contains the items that received multiple annotations. Note that this data should NOT be used for training metrics, as there may be overlap between the train/dev/test sets.

Retrieving articles from GEM

If you would like to access the articles that the Seahorse summaries are based on, you will need to retrieve them using their GEM ids.

The xsum, mlsum, and xlsum articles can all be retrieved through GEM on HuggingFace. The gem_id column points to the article in the GEM datasets.

The wikilingua article ids come from a previous version of the GEM dataset and should be retrieved using TensorFlow datasets. Here's an example of how to load the English wikilingua dataset into a dataframe:

import tensorflow_datasets as tfds

lang = 'english_en'
orig_split = 'validation'

ds, info = tfds.load(f'huggingface:gem/wiki_lingua_{lang}', split=orig_split, with_info=True)
hfdf = tfds.as_dataframe(ds,info)

Seahorse metrics

Metrics trained on Seahorse are available through HuggingFace. They can be found here. These are mT5-based metrics in two sizes (Large and XXL), each trained on one of the six dimensions of quality. Please see the paper for more details about these metrics.		Q1	Q2	Q3	Q4	Q5	Q6
mT5-XXL	seahorse-xxl-q1	seahorse-xxl-q2	seahorse-xxl-q3	seahorse-xxl-q4	seahorse-xxl-q5	seahorse-xxl-q6
mT5-Large	seahorse-large-q1	seahorse-large-q2	seahorse-large-q3	seahorse-large-q4	seahorse-large-q5	seahorse-large-q6

Note: If you want to use the metrics for labeling summaries (as opposed to looking at correlations or ROC-AUC as we did in the paper), you will need to select a threshold value to classify the metric's scores. The best threshold value will depend on your data and use case.

Leaderboard

We are maintaining a leaderboard with official results on our test set.

We ask you to not incorporate any part of the Seahorse validation set into the training data, and only use it for validation/hyperparameter tuning as development sets are typically used.

We report results on two metrics: Pearson correlation ($\rho$) and area under the ROC curve (roc).

		Q1		Q2		Q3		Q4		Q5		Q6
Model	Link	$\rho$	roc	$\rho$	roc	$\rho$	roc	$\rho$	roc	$\rho$	roc	$\rho$	roc
mT5-seahorse	[Clark et al. 2023]	0.52	0.90	0.86	0.98	0.45	0.84	0.59	0.85	0.50	0.80	0.52	0.81
mT5-XNLI	[Honovich et al. 2022, Conneau et al. 2018]	-	-	-	-	-	-	0.43	0.78	-	-	-	-
ROUGE-L	[Lin et al. 2004]	0.04	0.54	0.06	0.54	-0.03	0.43	0.13	0.55	0.03	0.54	0.02	0.54
Majority Class	-	-	0.5	-	0.5	-	0.5	-	0.5	-	0.5	-	0.5

Leaderboard Submission

If you want to submit to the leaderboard, please send an email to the contact email below with your results.

Updates

01/08/24 We updated the dataset encoding to be in unicode (previously ascii). Both the article and summaries should be input as unicode to the metrics. If you downloaded the dataset before this date, please make sure they are encoded correctly before running the metrics. (See issue #2 for an example.)
11/02/23 The metrics are now available on HuggingFace.

Contact

Please email seahorse-authors@google.com if you have any questions about the dataset.

google-research-datasets / seahorse

readme