Confusion about the blue4 of NarrativeQA

yyTraveler commented 2 months ago

i tried the experiment and followed the appendix methods. but the blue-4 is much higher than the metrix in paper.

using

allenai/unifiedqa-v2-t5-3b-1363200
sentence-transformers/multi-qa-mpnet-base-cos-v1

is there any other special in calculating the blue-4 ?

blue1	blue4	meteor	rough_l
0.21	0.10	0.17	0.31

yyTraveler commented 2 months ago

metrix code here, also changed from the allennlp


import nltk

try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt")
    nltk.download("wordnet")

import rouge
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.tokenize import word_tokenize
from nltk.translate.meteor_score import meteor_score
import copy

rouge_l_evaluator = rouge.Rouge(
    metrics=["rouge-l"],
    max_n=4,
    limit_length=True,
    length_limit=100,
    length_limit_type="words",
    apply_avg=True,
    apply_best=True,
    alpha=0.5,
    weight_factor=1.2,
    stemming=True,
)

def bleu_1(p, g):
    smoothie = SmoothingFunction().method4
    return sentence_bleu(g, p, weights=(1, 0, 0, 0), smoothing_function=smoothie)

def bleu_4(p, g):
    smoothie = SmoothingFunction().method4
    return sentence_bleu(g, p, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smoothie)

def meteor(p, g):
    return meteor_score([x.split() for x in g], p.split())

def meteor_with_tokenize(p: str, g: str):
    pp = word_tokenize(p)
    gg = [word_tokenize(g)]
    return meteor_score(gg, pp)

def rouge_l(p, g):
    return rouge_l_evaluator.get_scores(p, g)

def metric_max_over_ground_truths(metric_fn, prediction, ground_truths, tokenize=False):
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        if tokenize:
            score = metric_fn(word_tokenize(prediction), [word_tokenize(ground_truth)])
        else:
            score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
    if isinstance(score, dict) and "rouge-l" in score:
        max_score = copy.deepcopy(score)
        max_score["rouge-l"]["f"] = round(
            max([score["rouge-l"]["f"] for score in scores_for_ground_truths]), 2
        )
        max_score["rouge-l"]["p"] = round(
            max([score["rouge-l"]["p"] for score in scores_for_ground_truths]), 2
        )
        max_score["rouge-l"]["r"] = round(
            max([score["rouge-l"]["r"] for score in scores_for_ground_truths]), 2
        )
        return max_score
    else:
        return round(max(scores_for_ground_truths), 2)

def get_metric_score(prediction, ground_truths):
    bleu_1_score = metric_max_over_ground_truths(bleu_1, prediction, ground_truths, tokenize=True)
    bleu_4_score = metric_max_over_ground_truths(bleu_4, prediction, ground_truths, tokenize=True)
    meteor_score = metric_max_over_ground_truths(meteor_with_tokenize, prediction, ground_truths, tokenize=False)
    rouge_l_score = metric_max_over_ground_truths(
        rouge_l, prediction, ground_truths, tokenize=False
    )

    return (
        bleu_1_score,
        bleu_4_score,
        meteor_score,
        rouge_l_score["rouge-l"]["f"],
        rouge_l_score["rouge-l"]["p"],
        rouge_l_score["rouge-l"]["r"],
    )

yyTraveler commented 2 months ago

metrics.. sorry for typing :)

kimoji919 commented 2 months ago

Hello ,I was confused with experiment setting about NarrativeQA.Which should I choose as the instruction of prediction?Summary or full document?

yyTraveler commented 2 months ago

Hello ,I was confused with experiment setting about NarrativeQA.Which should I choose as the instruction of prediction?Summary or full document?

you should read the paper and code carefully. summary in building the tree and qa in llm answer.

kimoji919 commented 2 months ago

I think you may have misunderstood my meaning I would like to know the specific usage of this dataset in this article, and whether the original data used is the full-text part of the dataset or the abstract part of the dataset? I noticed that in the sixth page of the article, when introducing the dataset, it is mentioned that the Narrative QA dataset is based on question and answer pairs of full texts of books and movie scripts. And the article uses a length of 100 tokens when segmenting, while the summary length of the original dataset is about 600-900 tokens. I think it should not be the summary but the full text. I am looking for a general usage of this dataset in the LLM era, and may not pay attention to some technical details of the article itself, only considering the usage of the dataset. I understand this article as a Structured Hierarchical Retrieval, where nodes are constructed using the entire text during tree building, and then retrieved for QA. So what you mean is that we are still using full-text data, but in this article, we have processed the raw data into node wise summaries and then performed QA on the retrieved nodes, right?

我觉得你可能误解我的意思了我想知道该数据集在该文中的具体用法，使用的原始数据是数据集中的全文部分还是数据集中的摘要部分？我留意到该文中在第六页介绍数据集时提到Narrativeqa数据集是基于书籍和电影脚本全文的问答对。以及该文在切分片段时采用100个token的长度，而原数据集的摘要长度大概在600-900个tokens左右，我想应该不是摘要而是全文。我在找一种关于该数据集在LLM时代通用的用法，可能并不会关注到该篇本身的某些技术细节，仅仅考虑数据集用法。这篇文章我理解为一种Structured Hierarchical Retrieval，在构建树的时候用全文切片构建出一个个节点，然后检索进行qa。所以你的意思是使用的仍然是全文数据，只是在该文中对原始数据进行了分节点的摘要处理然后对检索到的节点进行qa，是吗？

yyTraveler commented 2 months ago

yes, it's always full-text for this paper.

quoting the experimental section of the original text:

The NarrativeQA-Story task requires a comprehensive understanding of the entire narrative in order to accurately answer its questions, thus testing the model’s ability to comprehend longer texts in the literary domain.

I think you may have misunderstood my meaning I would like to know the specific usage of this dataset in this article, and whether the original data used is the full-text part of the dataset or the abstract part of the dataset? I noticed that in the sixth page of the article, when introducing the dataset, it is mentioned that the Narrative QA dataset is based on question and answer pairs of full texts of books and movie scripts. And the article uses a length of 100 tokens when segmenting, while the summary length of the original dataset is about 600-900 tokens. I think it should not be the summary but the full text. I am looking for a general usage of this dataset in the LLM era, and may not pay attention to some technical details of the article itself, only considering the usage of the dataset. I understand this article as a Structured Hierarchical Retrieval, where nodes are constructed using the entire text during tree building, and then retrieved for QA. So what you mean is that we are still using full-text data, but in this article, we have processed the raw data into node wise summaries and then performed QA on the retrieved nodes, right?

我觉得你可能误解我的意思了我想知道该数据集在该文中的具体用法，使用的原始数据是数据集中的全文部分还是数据集中的摘要部分？我留意到该文中在第六页介绍数据集时提到Narrativeqa数据集是基于书籍和电影脚本全文的问答对。以及该文在切分片段时采用100个token的长度，而原数据集的摘要长度大概在600-900个tokens左右，我想应该不是摘要而是全文。我在找一种关于该数据集在LLM时代通用的用法，可能并不会关注到该篇本身的某些技术细节，仅仅考虑数据集用法。这篇文章我理解为一种Structured Hierarchical Retrieval，在构建树的时候用全文切片构建出一个个节点，然后检索进行qa。所以你的意思是使用的仍然是全文数据，只是在该文中对原始数据进行了分节点的摘要处理然后对检索到的节点进行qa，是吗？

ET-yzk commented 2 months ago

实验小白请教一下

应该使用那一部分数据集跑实验结果呀
在构建树的时候，是所有问题的full text一起构建，还是一个问题单独构建呀

我看了下论文，好像没发现有说

parthsarthi03 / raptor

Confusion about the blue4 of NarrativeQA #50