jaehwanj6 / Fact-Verification

0 stars 0 forks source link

Study Notes on Fact Verification Datasets #1

Open jaehwanj6 opened 3 years ago

jaehwanj6 commented 3 years ago

FEVER: a large-scale dataset for Fact Extraction and Verification

source: https://arxiv.org/abs/1803.05355

Abstract

Each claim —classify—> (Supported, Refuted, Not enough info)

For Supported and Refuted, evidence is also provided

Introduction

For Textual Entailment / NLI, the paragraph containing the evidence is explicitly provided. On the other hand, Textual Claim Verification searches through a vast corpus to identify the evidence.

For Question Answering, the question contains information needed to identify the answer, but information missing from a claim can often be crucial in retrieving refuting evidence.

Pipeline:

Identifies relevant documents
Identifies the sentences forming the evidence from the document 3 . Classify the claim w.r.t evidence.

Claim Generation

Extracting information from Wikipedia and generating claims from it

Processed 2017/06 Wikipedia dump with Stanford CoreNLP Sampled Sentence from introductory sections of 50,000 popular pages

Randomly choose a sentence —> make annotators generate a set of claims about the page’s subject

Tradeoff: simple generation —> paraphrase, too easy vs use world knowledge —> too hard Solution: “dictionary”, provide a list of terms hyperlinked in the original sentence & 1st sentence of their respective wikipedia entries.

Mutations were also generated: paraphrasing, negation, substitution of entity with a similar/dissimilar one, make claim more general/specific

Annotators had trouble generating non-trivial negation mutations (not ~~) Tried to discourage this behavior

Claim Labeling

(Supported, Refuted, Not Enough Info (the claim was too general/specific to be supported or refuted))

Questions for annotators:

Can I be ascertained of the claim’s validity solely based on the evidences?
If not, what extra infos should be added to the dictionary?

Annotators were allowed to add other urls to support claims. Annotators were encouraged to keep time under 2 min per claim

Data Validation

Validation of claim-generation: implicitly done in claim-labeling process. 3 ways to validate the claim-labeling process:

5-way inter-annotator agreement (Fleiss kappa score of 0.6841)
super-annotators (1% of data, experts with unlimited time to verify. Precision 95.42%, recall 72.36%)
manual validation by the authors (227 examples, discovered that 91.2% were correctly annotated

Baseline System Description

Pipeline: document retrieval —> sentence selection —> recognizing textual entailment

Document Retrieval: DrQA system (return k nearest documents for a query using cosine similarity between binned unigram and bigram TF-IDF vectors)
Sentence Selection: Rank sentences similar to the claim based on TF-IDF
Recognizing Textual Entailment: Decomposable attention model between the claim and the evidence passage

NotEnoughInfo does not have corresponding evidence —> error in input format Solution: sample a sentence from nearest page / sample a sentence at random

Best performance: 31.87% for verification & evidence detection task : 50.91% for verification only

The most challenging part is evidence identification

jaehwanj6 commented 3 years ago

DanFEVER

source: http://www.derczynski.com/papers/danfever.pdf

First non-English dataset on the FEVER task (*Manually created by 4 people) EnFEVER: English FEVER DanFEVER: Danish FEVER

Can be used for multilingual fact certification systems (?)

DanFEVER claims generated under the same guideline as EnFEVER

6407 claims made, each classified into Supported / Refuted / NotEnoughInfo Evidence for NEI was randomly sampled

Data for claims: Danish Wikipedia (80%)& Den Store Danske (20%) (online Danish Wikipedia) —> Wikipedia Data used for claim generations are open to public, but not for Den Store Danske

Two teams of two people independently sampled evidence Fleiss kappa-score was 0.75 and 0.82 for the two teams when measured on a reduced subset

Strategies for gathering specific texts for claims: selection of pages with well-known topics were selected from Wikipedia’s starred articles and Den Store Daske random selection of Wikipedia entities with abstracts were selected to ensure broad spectrum of topics

Common 3-step strategy for FEVER: Document Retrieval, Sentence Retrieval, Recognize Textual Entailment. DanFEVER provides baseline for RTE

Although DanFEVER dataset’s size is small, the model trained on DanFEVER has shown comparable performance to model trained on EnFEVER (F1 = 73% for MLP, F1 = 88% for Decomposable Attention network)

*Does not provide baseline result for the entire pipeline

jaehwanj6 commented 3 years ago

Adversarial Attacks against Fact Extraction and Verification

source: https://arxiv.org/abs/1903.05543

Abstract

A simple method of generating entailment-preserving and entailment-altering perturbations of instances by common patterns within the training data.

Introduction

When the model is evaluated on data outside of the distribution defined (implicitly)by its training dataset, its behavior is likely to be unpredictable; such “blind spots” can be exposed through adversarial evaluation.

Use adversarial evaluation for fact checking. adversarial evaluation —> understand the limitations of the systems and possibly use these adversarial instances to regularize the model through training data augmentation.

Fever 2.0: build it (create system based on original Fever) —> break it (generate adversarial examples targeting the first stage’s system) —> fix it (remedy the break)

“Breaking”: simple, rule-based transformation using the same evidence —> lowered the accuracy of the sofa models by 11.32% ~ 29.16%

Types of Adversarial Attacks

Manual Construction: humans create examples that exploit world knowledge, semantics, pragmatics, morphology, and syntactic variations.

Character-Level Perturbation: swap/insert letters

Distractor Information: concatenate short distractor sentences (entity substitution, generating a false answer based on rule-based substitution on a true answer) in SQuAD’s QA against passage. The additional info is about another entity, therefore irrelevant to the answer choice which should not change.

Paraphrasing: generate adversarial instances using alignments from parallel corpora from translation tasks using encoder-decoder models. 17.7 ~ 22.3% of adversarial examples were incorrect, and 14.0~19.3% were ungrammatical. Use phrase substitution to remedy this .

Programmatic Construction of Adversarial Dataset:

Change meaning by fiddling with numerical reasoning
Distractor phrases with strong negation that does not change the meaning
Mimic typos

Automated Generation: Use autoencoder to generate natural language adversaries that are natural. However, label for the newly generated sentences might have to be readjusted.

Adversarial Attacks Against FEVER

FEVER: % where both label & evidence match are correct Accuracy: % where label match up (ignore evidence) Potency of the adversarial instances: average error rate over all the predictions made by all systems caused by a breaker b Resilience of a system: fever score over all the accepted instances generated by all the breakers

3 types of transformations: entailment preserving, simple negation, complex negation

Experimental Setup

Baseline: TF-IDF information retrieval with a decomposable attention model for NLI

Adversarial examples led to a stark decrease in model accuracy for NLI, but not for Info Retrieval that uses TF-IDF or keyword matching

Accuracy reduction —> label preserving transformation > label-altering transformation Reveals the inherent bias of the model’s dependence on word overlap in antonymous examples

jaehwanj6 commented 3 years ago

The Second Fact Extraction and VERification (FEVER2.0) Shared Task

source: https://aclanthology.org/D19-6601.pdf

Introduction

Adversarial Evaluation: test the model’s blindspot by introducing data outside of the train set’s distribution

Builders

DOMLIN: document retrieval module of Hanselowski and Bert model for two-staged sentence selection and NLi CUNLP: document retrieval module of Google searches & TF-ID, pointer network based on BERT’s features and trained with RL GPLSI: for sentence selection, convert claim and candidate evidence into OpenIE-style triples and calculate semantic similarity

Breakers

TMLab: generate adversarial claims using Generative Enhanced Model, a modified and fine-tuned GPT-2. Annotators manually labeled claims and added evidence. Manually generate SUPPORTS claim

CUNLP: overcome the original FEVER’s shortcoming: lack of multi-hop inference. Produce multi-hop reasoning claim by augmenting existing claims with conjunctions or relative causes sourced from linked Wikipedia articles.

NbAuzDrLqg: for retrieval attack, created claims not containing entities that can be used as query terms For NLI attack, create attacks based on arithmetic operations, logical inconsistencies, vague/hedged statement

Fixers

CUNLP: improve multi-hop retrieval —> additional pointer network with the top 4 layers of a fine-tuned BERT Wikipedia title-to-document classifier as input features. Improve sentence selection —> modeling the sequence of relations at each time step through training a network to predict a sequence pointers to sentence in the evidence

Analysis

2 attacks with FEVER score of 0: paraphrase attack from TMLab (re-write sentences from Wikipedia articles in terms borrowed from different texts)and SubsetNum attack (required transitive reasoning w.r.t the area and size of geographic regions) from NbuzDrLqg. (evidence identification is wrong)

jaehwanj6 commented 3 years ago

FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information(0702)

source: https://arxiv.org/pdf/2106.05707.pdf

Abstract

Unstructured information: plain sentence Structured information: table

Each claim in FERVOUS has evidence in the form of sentences and/or cells from tables in Wikipedia FEVEROUS is the first large-scale verification dataset that focuses on sentences, tables, and the combination of the two FEVEROUS goal: I) retrieve evidence (sentence / table cell) relevant to the claim II) classify claim 3 ways (supported/refuted/NEI)

Baseline (Intro)

Sentence Retrieval: combination of entity matching & TF_IDF to extract the most relevant sentences
Table Evidence Retriever: linearize the given table & extract relevant cells as a binary sequence labelling task
Verdict predictor: A RoBERTa classifier pre-trained on multiple NLI datasets

FEVEROUS score on FEVEROUS: 18%; Retrieval module covers 28% of the claims’ evidence.

Claim Generation

FEVEROUS: Generate claim from “highlight” ( 4 consecutive sentences || table) VS FEVER: Generate claim from 1 sentence

Claim using highlight only: use 4 sentences || table exclusively to generate a claim; paraphrase of a single sentence not allowed
Claim beyond highlight: incorporate information beyond the scope of the highlight (same page or diff pages)
Mutated Claim: modify Type 1 or Type 2 by [more specific, generalization, negation, paraphrasing, entity substitution]

Claim Verification

Each claim may have up to 3 partially overlapping evidence set If every entry in the table is needed, they are all highlighted

New question: “Would you consider yourself misled by the claim given the evidence you found” “Evidence가 참이라 가정할 때, Claim이 잘못된 인식을 심어줄 수 있는가?”

Ex) Claim: “Shakira is Canadian”, Evidence: “Shakira is a Colombian singer, songwriter, dancer, and record producer”
FEVER —> NEI, might be dual-citizen
FEVEROUS —>REFUTED

Baseline Model

Retriever:

(Document Retriever) Combination of entity matching & TF-IDF using DrQA to select top k Wiki pages
(Evidence Retriever) Separately score L sentences and Q tables of the selected page (k = 5, l = 5, q = 3 in the paper) For Q tables, linearize table —> binary sequence labelling task —> retrieve relevant cells fine-tune RoBERTa model (input: claim concatenated with table)

Verdict Prediction:

Input: retrieved evidence — RoBERTa encoder with a linear layer —> verdict prediction

For tables, RoBERTa performs better with right linearization than taking table structure into account.

linearized table + (concatenate) + sentences —> enables cross-attention between cells

FEVEROUS dataset lacks NEI labels (only 5%) —> create NEI instances by removing sentence || table from claims that require both

Experiments:

NEI scarce —> rough label balance only for the test set prediction is correct iff verdict is correct && retrieved evidence is correct

FEVEROUS score calculation:

not every evidence may have been labeled —> precision X calculated in evidence retrieval

Results:

FEVEROUS score:

Evidence Retrieval:

measure recall on document level & passage level (i.e. sentences and tables)
Overall passage recall with entity matching: 55% vs without entity matching: 49%
If the gold table is given, recall for extracting evidence cell: 66% (71% when table has only 1 cell for evidence)

Verdict Prediction:

Oracle result (i.e. when given the correct evidence)

Discussion:

Retrieval of structured information: status quo: retrieve sentence evidence and table evidence separately. BUT, sentences around table might have valuable contextual info <— incorporate how??
Numerical Reasoning: current model shows low performance on numerical reasoning / even simple arithmetic operation 12% of FEVEROUS data require numerical reasoning
Verification of complex claims: FEVEROUS >> FEVER in len(claim) and # evidence required —> more evidence per claim required && related to each other FEVEROUS offers opportunity to look into how diff parts of a sentence is supported by evidence
Beyond Linearization: other way to analyze table given cell-level annotations? ex) joint training of cell selection & verdict prediction using Graph Neural Network

jaehwanj6 commented 3 years ago

GooAQ: Open Question Answering with Diverse Answer Types

source: https://arxiv.org/pdf/2104.08727.pdf

Abstract

GooAQ: QA dataset retrieved from Google

Questions: semi-automatically collected from Google search engine using autocomplete feature Answers: Google’s response to the respective questions

Result after training T5 model on GooAQ: 1) short-answer question rely on label 2) long answers rely on pre-trained knowledge

Disclaimer: Are we mimicking Google’s QA pipeline?

No! Google’s answer box service = AI-based QA system(s) + implicit user feedback (info contained in clicks / web link structures)+ explicit user feed back + expert curation of answer to common questions.

Paper’s goal: capture Google’s QA in a ‘standard’ NLP QA system

Introduction

A lot of QA sets contain rather simply-structured questions like (Who ~ / When ~ / How Many ~ ) (aka Short Answer)

Everyday questions can be more complicated with diverse answer types, such as:

(What is ~? Can you ~?) Question —> Answer: Multi-Sentence description (aka Snippet)
(What are ~? / Things to ~? / How to ~?) Question —> Answer: List (aka Collection)
Unit conversion / time zone conversion / etc Question —> Answer: ‘richer type’, ‘unique’

GooAQ has 3M questions covering the wide range of questions mentioned above. Automatically mined from Google’s search-autocomplete —> represent popular queries

Generative pre-trained language Models tested in paper: “self-contained reasoners” aka X explicit access to external info

Short Answer Q vs (Snippet Q & Collection Q) in terms of:

Benefit from pertaining? struggle with Short Answer Q vs perform surprisingly well for generating Snippet/ Collection Q
Benefit from label data? Notable benefit for Short Answer Q vs minimal gain for Snippet & Collection Q
Benefit from larger model? Performance boost of 5-10% for Short Answer Q vs 20+% for Snippet & Collection Q

Best score: although surprisingly high, still behind human gold responses

Related Work

Natural-Questions (NQ) dataset vs GooAQ

Both based on Google
NQ predominantly contains ‘who’ ‘when’ ‘how many’ Q
GooAQ has ‘how to’, ‘what is’, ‘what does’, ‘can you’ Q

ELI5 dataset vs GooAQ

ElI5 based on Reddit forums vs GooAQ cover wider spectrum
Model trained on GooAQ perform well under zero-shot testing against Eli

GooAQ Dataset Construction

1) Query Extraction (from search auto-complete)

Seed set of question terms (‘who’, ‘where’, ‘what’, ‘would’, ‘must’ etc. total 33 terms)
bootstrap by repeatedly querying on the prefix of previous Q
- Filter out Q < 5 tokens
- Created 5M Q, with average of 8 token length

2) Answer Extraction (from answer boxes)

Highlighted sentences extracted from various websites —> snippet / collection answer
Answer directly —> short answer
Infer answer type based on HTML tag

Data statistics

Short answer Q: ‘how many’, ‘where is’, ‘how much’, ‘who is’
Snippet Q: ‘what is’ + open ended. Ex) What is X? What is the difference between X and Y?
Collection Q: ingredients / how to achieve a goal. Ex) how to’, ‘what are’

NQ’s token length: mostly between 8~10
GooAQ’s token length: varying

Experiment

Set up Open QA. Input: Question —> Output: Answer. Context for answer X given; use background knowledge Split into 3 subtasks: short, collection, snippet —> model cannot infer the question type yet
Data Split Problem) Knowledge leakage from training data Solution) Collect most dissimilar sentences in eval / test (similarity measured by max token-overlap with train data)
Model T5-small and T5-11B
Evaluation Automatic: Rouge-L Human: Amazon Mechanical Turk people. Metric: Case where humans preferred model generated response over the answer. 50%: human parity

Result

Pertaining helpful for Snippet & Collection Q
- Humans preferred machine-generated response in 30% of case
  - Short Answer Q ask for “correctness” VS Snippet Q and Collection Q ask for “coherence”
Labeled data is helpful for Short Answer Q
- Niche question type like unit-conversion benefited the most
Bigger Model —> Better performance for Snippet & Collection Q (based on human evaluation)
- Automatic evaluation may not be ideal metric to use here
Few-Shot 11B performed the best, but still behind the gold annotations

Error analysis

‘Small’ model —> high error across all categories
Both model have low incoherence error —> bc of strong pre-training

jaehwanj6 commented 3 years ago

Fact Verification from Information-seeking Questions

source: https://arxiv.org/abs/2107.02153

Abstract

Use “information-seeking” questions posed by real users
“ambiguity” in question —> more challenging & confusing task

Introduction

Information-seeking Questions may contain ambiguity
- Realistic misinformation that may confuse people
Disambiguation: Since the original Q is unclear, we create multiple potential (q, a) pairs
Crossover of Disambiguation: given (q1, a1) and (q2, a2) —> (q1, a2) and (q2, a1) would be invalid
Low lexical bias than crowdsourced claims (model with no background knowledge performs similarity to random guessing)

Data

label: support / refute
X gold evidence (give silver evidence instead)
Generation process:
1. Use annotated info-seeking Qs to generate valid/invalid (q,a) pairs
2. Convert (q,a) —> claim

Data Sources

Natural Questions (NQ): English info-seeking questions from Google search queries
AmbigQA: disambiguated qa pairs for NQ questions

Composing Valid and invalid QA Pairs

FAVIQ: Ambiguous Dataset (A) + Regular Dataset (R)

Ambiguous Questions (A)

(q, {q1, a1}, {q2, a2}) —> Valid: (q1, a1), (q2, a2) | Invalid: (q1, a2), (q2, a1)
User asked an ambiguous question —> distinction between q1 & q2 should be subtle

Regular Questions (R)

Given a valid (q,a) pair, produce a_negative using a QA model
QA model: DPR + span extraction model
Naive a_negative generation: choose a_candidate with highest score prediction that is diff from actual a —> a_candidate might also be correct (false negative)
Solution: 1) obtain top k (=50) answers 2) sample a from #k answers 3) neural model (T5) for double-checking

QA —> Claim

(q,a) —> supported if and only if a is a valid answer to q, else refuted
Use T5-3B model for claim generation

Data Analysis

188k claims total

num samples in FAVIQ and FEVER are similar
FM2: a recently introduced fact verification dataset based on multi-player game

Local Mutual Information: measure of correlation between bigram and label (measure lexical bias)
FAVIQ has notably lower LMI than FEVER (less biased)

Experiment

All models are based on BART to predict support / refute
zero-shot: trained on FEVER
standard: trained on FAVIQ

Result on FEVER:

Claim only BART (input: only claim): 79% acc
TF-IDF + BART (input: k-passages + claim): 87% acc
DPR + BART (input: k-passages + claim): 90%

Result on FAVIQ

Training DPR

26M Wikipedia passages, k = 3
FAVIQ X have gold evidence, so generate silver evidence for DPR training by:
1. Take q from the claim
2. Use 1.’s q as query for TF-IDF
3. Take the top passage that contains the answer

Result

FAVIQ without doc retrieval: 50%s (random guessing) vs 80% for FEVER —> less bias in FAVIQ
R dataset’s acc >> A dataset’s acc —> Ambiguous dataset is more challenging

Professional Fact Checking Experiment

SNOPES: 6,422 claims annotated by professional fact-checkers, gold evidence given
SCIFACT: 1,109 claims based on scientific papers, annotated by domain experts, gold evidence X given (use TF-IDF on sentence level)
Model: BART

FAVIQ’s F1 > FEVER’s F1
Fine-tuning on target data is helpful

jaehwanj6 commented 3 years ago

Adversarial NLI: A New Benchmark for Natural Language Understanding

Abstract

Create ANLI using “adversarial human-and-model-in-the-loop” procedure

Introduction

NLU benchmarks become quickly obsolete (especially after the arrival of BERT)
Many SOTA models exploit spurious statistical patterns in datasets
Proposal: adversarial human-and-model-in-the-loop —> NLU dataset
- Inspired from gasified collaborative training “builders” vs “breakers”

Dataset Collection

HAMLET (Human-And-Model-in-the-Loop Enabled Training) to create ANLI

human vs model
premise, target label — “writer” —> hypothesis
premise, hypothesis, target label — “verifier” —> verification on the sample

If target label == Model(premise, hypothesis) —> add to train Elif Verification(premise, hypothesis, target label) —> Add to train, dev, test

Test set: 1) generated from “exclusive annotators” & 2) balanced by label classes

Annotation details

For a given context, label pair, writers create hypothesis until 1) model gets wrong 2) exceeds a threshold (5 tries in 1st round, 10tries for 2nd / 3rd round)

Round 1

Model: BERT-Large trained on SNLI & MNLI
Context: short multi-sentence passage from Wikipedia from HotpotQA training set

Round 2

Model: RoBERTa trained on A1, SNLI, MNLI, NLI-version of FEVER
Context: a new non-overlapping set of contexts from Wikipedia (like Round 1)

Round 3

Model: RoBERTa trained on A2, A1, SNLI, MNLI, NLI-version of FEVER
Context: Wikipedia, News, fiction, CBT, formal spoken text, longer contexts in GLUE RTE training data

Comparing with other datasets

SNLI: premises are short bc they come from image captioning domain vs ANLI: longer context, harder H
hypothesis-only model —> poor performance on ANLI
subset of annotators only in test set —> X annotator bias

Dataset Statistics

Num examples in A1: 19k, A2: 47k, A3: 103k
- model performance developed from A1 —> A3, so requires more guesses for writers

Verified error decrease from 18% (A1) —> 10>% (A3)

Results

Base model performance is < 33% —> workers found vulnerabilities applicable to the entire model class
Rounds become increasingly more difficult
Training on more rounds improves robustness
Adversarial Training is effective:

Stress Test Results

All our models outperform the models in the original paper

Hypothesis-only results

hypothesis-only models perform poorly on ANLI than SNLI
For rounds 2,3: RoBERTa performance between (hypo & premise) and (hypothesis only) similar: 1) test data is very difficult 2) training data is not good.
To show that 2) is not the case, trained models only on ANLI & compared it to ALL —> X big diff. Conclusion: test set is difficult

Linguistic Analysis