google-research-datasets / seahorse

Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness, covering 6 languages, 9 systems and 4 datasets.
84 stars 8 forks source link

The input length for the learnt metrics model #3

Open ArtemBiliksin opened 1 month ago

ArtemBiliksin commented 1 month ago

Hello!

Thanks for releasing the dataset!

The paper in Appendix A states that the input sequence length for the learned metrics model is 2048. How is 2048 distributed between article and summary in the input sequence "premise: {article} hypothesis: {summary}"? For example, the "premise: {article}" part is no more than 1024 tokens and the "hypothesis: {summary}" part is no more than 1024 tokens. Another example: the "premise: {article}" part is no more than 1536 tokens and the "hypothesis: {summary}" part is no more than 512 tokens (1536 + 512 = 2048). How was the possible number of tokens divided into two parts?

In the test dataset there is an article and its corresponding summary for which the input sequence "premise: {article} hypothesis: {summary}" has more than 2048 tokens. Such examples are 4.7% of the entire test dataset.

eaclark07 commented 1 month ago

Hi Artem--For the learned metric experiments in the paper, the premise and hypothesis are concatenated together and truncated to fit the 2048 token length. As you point out, this will truncate important information for some examples, so smarter truncation strategies are worth exploring, e.g., dynamically truncating the end of the article to fit the full summary into the sequence length.

ArtemBiliksin commented 1 month ago

Thanks for the reply!

I tried to reproduce the value of the Q4 metric from Table 6 using the large google/seahorse-large-q4 model. I was not able to get exactly the metrics from the article. I will describe my setup and results in more detail below.

Data: I downloaded the test dataset from SEAHORSE. I recovered the articles from the GEM benchmark: mlsum, xsum, xlsum, wiki_lingua (important: you must select the right version of wiki_lingua, which is available by SHA1=b864b63...). I left samples that have a Q4 value of "Yes" or "No", as described in the article in Appendix A.

Model: google/seahorse-large-q4.

tokenizer = AutoTokenizer.from_pretrained("google/seahorse-large-q4")
model = AutoModelForSeq2SeqLM.from_pretrained("google/seahorse-large-q4")
model.eval()

Model input: The article and summary are separated by the tags "premise:" and "hypothesis:", i.e. prompt = "premise: {} hypothesis: {}". And then truncated to 2048 as you wrote above. The implementation is below:

inputs = tokenizer.encode(prompt.format(article , summary), max_length=2048, truncation=True, return_tensors="pt")

It is worth noting that with this truncation scheme, samples with an article larger than 2048 tokens will enter the model without summaries. This strongly affects the model predictions.

Generate: Next, I generate a single token and get a probability estimate for class "1":

outputs = model.generate(inputs, max_new_tokens=1, output_logits=True, return_dict_in_generate=True)
logits = outputs.logits
logits_for_class_token = logits[0]
probs_for_class_token = logits_for_clf_token.softmax(-1)
prob1 = probs_for_class_token [0, 333]    # "1" - 333
Metric: I collect the probability estimates in the y_score list. The true labels "Yes" and "No" are collected in the list y_true, replacing "Yes" by 1 and "No" by 0. My results for Q4 are tabulated. Metric My Q4 result The Q4 result of the article
$\rho$ 0.513 0.55
roc 0.80 0.82

Where the metric was calculated as follows: $\rho$ = numpy.corrcoef(y_true, y_score)[0, 1] roc = sklearn.metrics.roc_auc_score(y_true, y_score)

I did the same procedure for the Q5. The results are in the table below: Metric My Q5 result The Q5 result of the article
$\rho$ 0.414 0.46
roc 0.753 0.78

QUESTIONS.

  1. Am I doing the truncation correctly? Did you do such a truncation in the article?
  2. Could it be that Hugging Face presents a checkpoint that is different from the checkpoint from the article?
  3. Can you share a script to evaluate the metrics for Q4?