jaehwanj6 / Fact-Verification

0 stars 0 forks source link

Study Notes on Fact Verification Datasets #1

Open jaehwanj6 opened 3 years ago

jaehwanj6 commented 3 years ago

FEVER: a large-scale dataset for Fact Extraction and Verification

source: https://arxiv.org/abs/1803.05355

Abstract

Each claim —classify—> (Supported, Refuted, Not enough info)

스크린샷 2021-08-02 오후 2 50 57

For Supported and Refuted, evidence is also provided

Introduction

For Textual Entailment / NLI, the paragraph containing the evidence is explicitly provided. On the other hand, Textual Claim Verification searches through a vast corpus to identify the evidence.

For Question Answering, the question contains information needed to identify the answer, but information missing from a claim can often be crucial in retrieving refuting evidence.

Pipeline:

  1. Identifies relevant documents
  2. Identifies the sentences forming the evidence from the document 3 . Classify the claim w.r.t evidence.

Claim Generation

Processed 2017/06 Wikipedia dump with Stanford CoreNLP Sampled Sentence from introductory sections of 50,000 popular pages

Randomly choose a sentence —> make annotators generate a set of claims about the page’s subject

Tradeoff: simple generation —> paraphrase, too easy vs use world knowledge —> too hard Solution: “dictionary”, provide a list of terms hyperlinked in the original sentence & 1st sentence of their respective wikipedia entries.

Mutations were also generated: paraphrasing, negation, substitution of entity with a similar/dissimilar one, make claim more general/specific

Annotators had trouble generating non-trivial negation mutations (not ~~) Tried to discourage this behavior

Claim Labeling

(Supported, Refuted, Not Enough Info (the claim was too general/specific to be supported or refuted))

Questions for annotators:

  1. Can I be ascertained of the claim’s validity solely based on the evidences?
  2. If not, what extra infos should be added to the dictionary?

Annotators were allowed to add other urls to support claims. Annotators were encouraged to keep time under 2 min per claim

Data Validation

Validation of claim-generation: implicitly done in claim-labeling process. 3 ways to validate the claim-labeling process:

  1. 5-way inter-annotator agreement (Fleiss kappa score of 0.6841)
  2. super-annotators (1% of data, experts with unlimited time to verify. Precision 95.42%, recall 72.36%)
  3. manual validation by the authors (227 examples, discovered that 91.2% were correctly annotated

Baseline System Description

Pipeline: document retrieval —> sentence selection —> recognizing textual entailment

  1. Document Retrieval: DrQA system (return k nearest documents for a query using cosine similarity between binned unigram and bigram TF-IDF vectors)
  2. Sentence Selection: Rank sentences similar to the claim based on TF-IDF
  3. Recognizing Textual Entailment: Decomposable attention model between the claim and the evidence passage

NotEnoughInfo does not have corresponding evidence —> error in input format Solution: sample a sentence from nearest page / sample a sentence at random

Best performance: 31.87% for verification & evidence detection task : 50.91% for verification only

The most challenging part is evidence identification

jaehwanj6 commented 3 years ago

DanFEVER

source: http://www.derczynski.com/papers/danfever.pdf

First non-English dataset on the FEVER task (*Manually created by 4 people) EnFEVER: English FEVER DanFEVER: Danish FEVER

Can be used for multilingual fact certification systems (?)

DanFEVER claims generated under the same guideline as EnFEVER

스크린샷 2021-07-01 오후 4 00 13 스크린샷 2021-07-01 오후 4 02 13

6407 claims made, each classified into Supported / Refuted / NotEnoughInfo Evidence for NEI was randomly sampled

Data for claims: Danish Wikipedia (80%)& Den Store Danske (20%) (online Danish Wikipedia) —> Wikipedia Data used for claim generations are open to public, but not for Den Store Danske

Two teams of two people independently sampled evidence Fleiss kappa-score was 0.75 and 0.82 for the two teams when measured on a reduced subset

Strategies for gathering specific texts for claims: selection of pages with well-known topics were selected from Wikipedia’s starred articles and Den Store Daske random selection of Wikipedia entities with abstracts were selected to ensure broad spectrum of topics

Common 3-step strategy for FEVER: Document Retrieval, Sentence Retrieval, Recognize Textual Entailment. DanFEVER provides baseline for RTE

스크린샷 2021-07-01 오후 4 00 31

Although DanFEVER dataset’s size is small, the model trained on DanFEVER has shown comparable performance to model trained on EnFEVER (F1 = 73% for MLP, F1 = 88% for Decomposable Attention network)

*Does not provide baseline result for the entire pipeline

스크린샷 2021-07-01 오후 4 00 37
jaehwanj6 commented 3 years ago

Adversarial Attacks against Fact Extraction and Verification

source: https://arxiv.org/abs/1903.05543

Abstract

A simple method of generating entailment-preserving and entailment-altering perturbations of instances by common patterns within the training data.

Introduction

When the model is evaluated on data outside of the distribution defined (implicitly)by its training dataset, its behavior is likely to be unpredictable; such “blind spots” can be exposed through adversarial evaluation.

Use adversarial evaluation for fact checking. adversarial evaluation —> understand the limitations of the systems and possibly use these adversarial instances to regularize the model through training data augmentation.

Fever 2.0: build it (create system based on original Fever) —> break it (generate adversarial examples targeting the first stage’s system) —> fix it (remedy the break)

“Breaking”: simple, rule-based transformation using the same evidence —> lowered the accuracy of the sofa models by 11.32% ~ 29.16%

Types of Adversarial Attacks

Manual Construction: humans create examples that exploit world knowledge, semantics, pragmatics, morphology, and syntactic variations.

Character-Level Perturbation: swap/insert letters

Distractor Information: concatenate short distractor sentences (entity substitution, generating a false answer based on rule-based substitution on a true answer) in SQuAD’s QA against passage. The additional info is about another entity, therefore irrelevant to the answer choice which should not change.

Paraphrasing: generate adversarial instances using alignments from parallel corpora from translation tasks using encoder-decoder models. 17.7 ~ 22.3% of adversarial examples were incorrect, and 14.0~19.3% were ungrammatical. Use phrase substitution to remedy this .

Programmatic Construction of Adversarial Dataset:

  1. Change meaning by fiddling with numerical reasoning
  2. Distractor phrases with strong negation that does not change the meaning
  3. Mimic typos

Automated Generation: Use autoencoder to generate natural language adversaries that are natural. However, label for the newly generated sentences might have to be readjusted.

Adversarial Attacks Against FEVER

FEVER: % where both label & evidence match are correct Accuracy: % where label match up (ignore evidence) Potency of the adversarial instances: average error rate over all the predictions made by all systems caused by a breaker b Resilience of a system: fever score over all the accepted instances generated by all the breakers

3 types of transformations: entailment preserving, simple negation, complex negation

Experimental Setup

Baseline: TF-IDF information retrieval with a decomposable attention model for NLI

Adversarial examples led to a stark decrease in model accuracy for NLI, but not for Info Retrieval that uses TF-IDF or keyword matching

Accuracy reduction —> label preserving transformation > label-altering transformation Reveals the inherent bias of the model’s dependence on word overlap in antonymous examples

jaehwanj6 commented 3 years ago

The Second Fact Extraction and VERification (FEVER2.0) Shared Task

source: https://aclanthology.org/D19-6601.pdf

Introduction

Adversarial Evaluation: test the model’s blindspot by introducing data outside of the train set’s distribution

Builders

DOMLIN: document retrieval module of Hanselowski and Bert model for two-staged sentence selection and NLi CUNLP: document retrieval module of Google searches & TF-ID, pointer network based on BERT’s features and trained with RL GPLSI: for sentence selection, convert claim and candidate evidence into OpenIE-style triples and calculate semantic similarity

Breakers

TMLab: generate adversarial claims using Generative Enhanced Model, a modified and fine-tuned GPT-2. Annotators manually labeled claims and added evidence. Manually generate SUPPORTS claim

CUNLP: overcome the original FEVER’s shortcoming: lack of multi-hop inference. Produce multi-hop reasoning claim by augmenting existing claims with conjunctions or relative causes sourced from linked Wikipedia articles.

NbAuzDrLqg: for retrieval attack, created claims not containing entities that can be used as query terms For NLI attack, create attacks based on arithmetic operations, logical inconsistencies, vague/hedged statement

Fixers

CUNLP: improve multi-hop retrieval —> additional pointer network with the top 4 layers of a fine-tuned BERT Wikipedia title-to-document classifier as input features. Improve sentence selection —> modeling the sequence of relations at each time step through training a network to predict a sequence pointers to sentence in the evidence

Analysis

2 attacks with FEVER score of 0: paraphrase attack from TMLab (re-write sentences from Wikipedia articles in terms borrowed from different texts)and SubsetNum attack (required transitive reasoning w.r.t the area and size of geographic regions) from NbuzDrLqg. (evidence identification is wrong)

jaehwanj6 commented 3 years ago

FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information(0702)

source: https://arxiv.org/pdf/2106.05707.pdf

Abstract

Unstructured information: plain sentence Structured information: table

Each claim in FERVOUS has evidence in the form of sentences and/or cells from tables in Wikipedia FEVEROUS is the first large-scale verification dataset that focuses on sentences, tables, and the combination of the two FEVEROUS goal: I) retrieve evidence (sentence / table cell) relevant to the claim II) classify claim 3 ways (supported/refuted/NEI)

스크린샷 2021-07-02 오전 11 11 38

Baseline (Intro)

스크린샷 2021-07-02 오전 11 12 03

FEVEROUS score on FEVEROUS: 18%; Retrieval module covers 28% of the claims’ evidence.

Claim Generation

FEVEROUS: Generate claim from “highlight” ( 4 consecutive sentences || table) VS FEVER: Generate claim from 1 sentence

  1. Claim using highlight only: use 4 sentences || table exclusively to generate a claim; paraphrase of a single sentence not allowed
  2. Claim beyond highlight: incorporate information beyond the scope of the highlight (same page or diff pages)
  3. Mutated Claim: modify Type 1 or Type 2 by [more specific, generalization, negation, paraphrasing, entity substitution]

Claim Verification

Each claim may have up to 3 partially overlapping evidence set If every entry in the table is needed, they are all highlighted

New question: “Would you consider yourself misled by the claim given the evidence you found” “Evidence가 참이라 가정할 때, Claim이 잘못된 인식을 심어줄 수 있는가?”

Ex) Claim: “Shakira is Canadian”, Evidence: “Shakira is a Colombian singer, songwriter, dancer, and record producer”
FEVER —> NEI, might be dual-citizen
FEVEROUS —>REFUTED
스크린샷 2021-07-02 오전 11 25 50 스크린샷 2021-07-02 오후 2 14 52

Baseline Model

Retriever:

  1. (Document Retriever) Combination of entity matching & TF-IDF using DrQA to select top k Wiki pages
  2. (Evidence Retriever) Separately score L sentences and Q tables of the selected page (k = 5, l = 5, q = 3 in the paper) For Q tables, linearize table —> binary sequence labelling task —> retrieve relevant cells fine-tune RoBERTa model (input: claim concatenated with table)

Verdict Prediction:

Input: retrieved evidence — RoBERTa encoder with a linear layer —> verdict prediction

For tables, RoBERTa performs better with right linearization than taking table structure into account.

linearized table + (concatenate) + sentences —> enables cross-attention between cells

FEVEROUS dataset lacks NEI labels (only 5%) —> create NEI instances by removing sentence || table from claims that require both

Experiments:

NEI scarce —> rough label balance only for the test set prediction is correct iff verdict is correct && retrieved evidence is correct

FEVEROUS score calculation:

스크린샷 2021-07-02 오후 2 20 17

not every evidence may have been labeled —> precision X calculated in evidence retrieval

Results:

FEVEROUS score:

스크린샷 2021-07-02 오후 2 19 53 스크린샷 2021-07-02 오후 2 26 34

Evidence Retrieval:

Verdict Prediction:

Discussion:

jaehwanj6 commented 3 years ago

GooAQ: Open Question Answering with Diverse Answer Types

source: https://arxiv.org/pdf/2104.08727.pdf

Abstract

GooAQ: QA dataset retrieved from Google

Questions: semi-automatically collected from Google search engine using autocomplete feature Answers: Google’s response to the respective questions

Result after training T5 model on GooAQ: 1) short-answer question rely on label 2) long answers rely on pre-trained knowledge

Disclaimer: Are we mimicking Google’s QA pipeline?

No! Google’s answer box service = AI-based QA system(s) + implicit user feedback (info contained in clicks / web link structures)+ explicit user feed back + expert curation of answer to common questions.

Paper’s goal: capture Google’s QA in a ‘standard’ NLP QA system

Introduction

A lot of QA sets contain rather simply-structured questions like (Who ~ / When ~ / How Many ~ ) (aka Short Answer)

Everyday questions can be more complicated with diverse answer types, such as:

  1. (What is ~? Can you ~?) Question —> Answer: Multi-Sentence description (aka Snippet)
  2. (What are ~? / Things to ~? / How to ~?) Question —> Answer: List (aka Collection)
  3. Unit conversion / time zone conversion / etc Question —> Answer: ‘richer type’, ‘unique’

GooAQ has 3M questions covering the wide range of questions mentioned above. Automatically mined from Google’s search-autocomplete —> represent popular queries

Generative pre-trained language Models tested in paper: “self-contained reasoners” aka X explicit access to external info

Short Answer Q vs (Snippet Q & Collection Q) in terms of:

  1. Benefit from pertaining? struggle with Short Answer Q vs perform surprisingly well for generating Snippet/ Collection Q
  2. Benefit from label data? Notable benefit for Short Answer Q vs minimal gain for Snippet & Collection Q
  3. Benefit from larger model? Performance boost of 5-10% for Short Answer Q vs 20+% for Snippet & Collection Q

Best score: although surprisingly high, still behind human gold responses

Related Work

Natural-Questions (NQ) dataset vs GooAQ

ELI5 dataset vs GooAQ

스크린샷 2021-07-23 오후 4 09 38

GooAQ Dataset Construction

1) Query Extraction (from search auto-complete)

  1. Seed set of question terms (‘who’, ‘where’, ‘what’, ‘would’, ‘must’ etc. total 33 terms)
  2. bootstrap by repeatedly querying on the prefix of previous Q
    • Filter out Q < 5 tokens
    • Created 5M Q, with average of 8 token length

2) Answer Extraction (from answer boxes)

Data statistics

스크린샷 2021-07-23 오후 3 47 19 스크린샷 2021-07-23 오후 3 46 57

Experiment

Result

스크린샷 2021-07-23 오후 4 01 28

Error analysis

스크린샷 2021-07-23 오후 4 07 24
jaehwanj6 commented 3 years ago

Fact Verification from Information-seeking Questions

source: https://arxiv.org/abs/2107.02153

Abstract

Introduction

Data

스크린샷 2021-08-09 오후 2 51 40

Data Sources

Composing Valid and invalid QA Pairs

FAVIQ: Ambiguous Dataset (A) + Regular Dataset (R)

Ambiguous Questions (A)

Regular Questions (R)

QA —> Claim

Data Analysis

스크린샷 2021-08-09 오후 2 03 43 스크린샷 2021-08-09 오후 2 05 04 스크린샷 2021-08-09 오후 2 07 13

Experiment

Result on FEVER:

Result on FAVIQ

Training DPR

Result

스크린샷 2021-08-09 오후 2 37 27

Professional Fact Checking Experiment

스크린샷 2021-08-09 오후 2 43 30
jaehwanj6 commented 3 years ago

Adversarial NLI: A New Benchmark for Natural Language Understanding

Abstract

Introduction

Dataset Collection

HAMLET (Human-And-Model-in-the-Loop Enabled Training) to create ANLI

스크린샷 2021-08-13 오후 12 26 57

If target label == Model(premise, hypothesis) —> add to train Elif Verification(premise, hypothesis, target label) —> Add to train, dev, test

Annotation details

Round 1

Round 2

Round 3

Comparing with other datasets

Dataset Statistics

스크린샷 2021-08-13 오후 12 54 02

Results

스크린샷 2021-08-13 오후 12 55 50 스크린샷 2021-08-13 오후 1 01 43

Stress Test Results

스크린샷 2021-08-13 오후 1 03 13

Hypothesis-only results

스크린샷 2021-08-13 오후 1 13 21

Linguistic Analysis

스크린샷 2021-08-13 오후 1 14 13