Open jaehwanj6 opened 3 years ago
source: http://www.derczynski.com/papers/danfever.pdf
First non-English dataset on the FEVER task (*Manually created by 4 people) EnFEVER: English FEVER DanFEVER: Danish FEVER
Can be used for multilingual fact certification systems (?)
DanFEVER claims generated under the same guideline as EnFEVER
6407 claims made, each classified into Supported / Refuted / NotEnoughInfo Evidence for NEI was randomly sampled
Data for claims: Danish Wikipedia (80%)& Den Store Danske (20%) (online Danish Wikipedia) —> Wikipedia Data used for claim generations are open to public, but not for Den Store Danske
Two teams of two people independently sampled evidence Fleiss kappa-score was 0.75 and 0.82 for the two teams when measured on a reduced subset
Strategies for gathering specific texts for claims: selection of pages with well-known topics were selected from Wikipedia’s starred articles and Den Store Daske random selection of Wikipedia entities with abstracts were selected to ensure broad spectrum of topics
Common 3-step strategy for FEVER: Document Retrieval, Sentence Retrieval, Recognize Textual Entailment. DanFEVER provides baseline for RTE
Although DanFEVER dataset’s size is small, the model trained on DanFEVER has shown comparable performance to model trained on EnFEVER (F1 = 73% for MLP, F1 = 88% for Decomposable Attention network)
*Does not provide baseline result for the entire pipeline
source: https://arxiv.org/abs/1903.05543
A simple method of generating entailment-preserving and entailment-altering perturbations of instances by common patterns within the training data.
When the model is evaluated on data outside of the distribution defined (implicitly)by its training dataset, its behavior is likely to be unpredictable; such “blind spots” can be exposed through adversarial evaluation.
Use adversarial evaluation for fact checking. adversarial evaluation —> understand the limitations of the systems and possibly use these adversarial instances to regularize the model through training data augmentation.
Fever 2.0: build it (create system based on original Fever) —> break it (generate adversarial examples targeting the first stage’s system) —> fix it (remedy the break)
“Breaking”: simple, rule-based transformation using the same evidence —> lowered the accuracy of the sofa models by 11.32% ~ 29.16%
Manual Construction: humans create examples that exploit world knowledge, semantics, pragmatics, morphology, and syntactic variations.
Character-Level Perturbation: swap/insert letters
Distractor Information: concatenate short distractor sentences (entity substitution, generating a false answer based on rule-based substitution on a true answer) in SQuAD’s QA against passage. The additional info is about another entity, therefore irrelevant to the answer choice which should not change.
Paraphrasing: generate adversarial instances using alignments from parallel corpora from translation tasks using encoder-decoder models. 17.7 ~ 22.3% of adversarial examples were incorrect, and 14.0~19.3% were ungrammatical. Use phrase substitution to remedy this .
Programmatic Construction of Adversarial Dataset:
Automated Generation: Use autoencoder to generate natural language adversaries that are natural. However, label for the newly generated sentences might have to be readjusted.
FEVER: % where both label & evidence match are correct Accuracy: % where label match up (ignore evidence) Potency of the adversarial instances: average error rate over all the predictions made by all systems caused by a breaker b Resilience of a system: fever score over all the accepted instances generated by all the breakers
3 types of transformations: entailment preserving, simple negation, complex negation
Baseline: TF-IDF information retrieval with a decomposable attention model for NLI
Adversarial examples led to a stark decrease in model accuracy for NLI, but not for Info Retrieval that uses TF-IDF or keyword matching
Accuracy reduction —> label preserving transformation > label-altering transformation Reveals the inherent bias of the model’s dependence on word overlap in antonymous examples
source: https://aclanthology.org/D19-6601.pdf
Adversarial Evaluation: test the model’s blindspot by introducing data outside of the train set’s distribution
DOMLIN: document retrieval module of Hanselowski and Bert model for two-staged sentence selection and NLi CUNLP: document retrieval module of Google searches & TF-ID, pointer network based on BERT’s features and trained with RL GPLSI: for sentence selection, convert claim and candidate evidence into OpenIE-style triples and calculate semantic similarity
TMLab: generate adversarial claims using Generative Enhanced Model, a modified and fine-tuned GPT-2. Annotators manually labeled claims and added evidence. Manually generate SUPPORTS claim
CUNLP: overcome the original FEVER’s shortcoming: lack of multi-hop inference. Produce multi-hop reasoning claim by augmenting existing claims with conjunctions or relative causes sourced from linked Wikipedia articles.
NbAuzDrLqg: for retrieval attack, created claims not containing entities that can be used as query terms For NLI attack, create attacks based on arithmetic operations, logical inconsistencies, vague/hedged statement
CUNLP: improve multi-hop retrieval —> additional pointer network with the top 4 layers of a fine-tuned BERT Wikipedia title-to-document classifier as input features. Improve sentence selection —> modeling the sequence of relations at each time step through training a network to predict a sequence pointers to sentence in the evidence
2 attacks with FEVER score of 0: paraphrase attack from TMLab (re-write sentences from Wikipedia articles in terms borrowed from different texts)and SubsetNum attack (required transitive reasoning w.r.t the area and size of geographic regions) from NbuzDrLqg. (evidence identification is wrong)
source: https://arxiv.org/pdf/2106.05707.pdf
Unstructured information: plain sentence Structured information: table
Each claim in FERVOUS has evidence in the form of sentences and/or cells from tables in Wikipedia FEVEROUS is the first large-scale verification dataset that focuses on sentences, tables, and the combination of the two FEVEROUS goal: I) retrieve evidence (sentence / table cell) relevant to the claim II) classify claim 3 ways (supported/refuted/NEI)
Sentence Retrieval: combination of entity matching & TF_IDF to extract the most relevant sentences
Table Evidence Retriever: linearize the given table & extract relevant cells as a binary sequence labelling task
Verdict predictor: A RoBERTa classifier pre-trained on multiple NLI datasets
FEVEROUS score on FEVEROUS: 18%; Retrieval module covers 28% of the claims’ evidence.
FEVEROUS: Generate claim from “highlight” ( 4 consecutive sentences || table) VS FEVER: Generate claim from 1 sentence
Each claim may have up to 3 partially overlapping evidence set If every entry in the table is needed, they are all highlighted
New question: “Would you consider yourself misled by the claim given the evidence you found” “Evidence가 참이라 가정할 때, Claim이 잘못된 인식을 심어줄 수 있는가?”
Ex) Claim: “Shakira is Canadian”, Evidence: “Shakira is a Colombian singer, songwriter, dancer, and record producer”
FEVER —> NEI, might be dual-citizen
FEVEROUS —>REFUTED
Retriever:
Verdict Prediction:
Input: retrieved evidence — RoBERTa encoder with a linear layer —> verdict prediction
For tables, RoBERTa performs better with right linearization than taking table structure into account.
linearized table + (concatenate) + sentences —> enables cross-attention between cells
FEVEROUS dataset lacks NEI labels (only 5%) —> create NEI instances by removing sentence || table from claims that require both
NEI scarce —> rough label balance only for the test set prediction is correct iff verdict is correct && retrieved evidence is correct
FEVEROUS score calculation:
not every evidence may have been labeled —> precision X calculated in evidence retrieval
FEVEROUS score:
Evidence Retrieval:
Verdict Prediction:
Retrieval of structured information: status quo: retrieve sentence evidence and table evidence separately. BUT, sentences around table might have valuable contextual info <— incorporate how??
Numerical Reasoning: current model shows low performance on numerical reasoning / even simple arithmetic operation 12% of FEVEROUS data require numerical reasoning
Verification of complex claims: FEVEROUS >> FEVER in len(claim) and # evidence required —> more evidence per claim required && related to each other FEVEROUS offers opportunity to look into how diff parts of a sentence is supported by evidence
Beyond Linearization: other way to analyze table given cell-level annotations? ex) joint training of cell selection & verdict prediction using Graph Neural Network
source: https://arxiv.org/pdf/2104.08727.pdf
GooAQ: QA dataset retrieved from Google
Questions: semi-automatically collected from Google search engine using autocomplete feature Answers: Google’s response to the respective questions
Result after training T5 model on GooAQ: 1) short-answer question rely on label 2) long answers rely on pre-trained knowledge
Disclaimer: Are we mimicking Google’s QA pipeline?
No! Google’s answer box service = AI-based QA system(s) + implicit user feedback (info contained in clicks / web link structures)+ explicit user feed back + expert curation of answer to common questions.
Paper’s goal: capture Google’s QA in a ‘standard’ NLP QA system
A lot of QA sets contain rather simply-structured questions like (Who ~ / When ~ / How Many ~ ) (aka Short Answer)
Everyday questions can be more complicated with diverse answer types, such as:
GooAQ has 3M questions covering the wide range of questions mentioned above. Automatically mined from Google’s search-autocomplete —> represent popular queries
Generative pre-trained language Models tested in paper: “self-contained reasoners” aka X explicit access to external info
Short Answer Q vs (Snippet Q & Collection Q) in terms of:
Best score: although surprisingly high, still behind human gold responses
Set up Open QA. Input: Question —> Output: Answer. Context for answer X given; use background knowledge Split into 3 subtasks: short, collection, snippet —> model cannot infer the question type yet
Data Split Problem) Knowledge leakage from training data Solution) Collect most dissimilar sentences in eval / test (similarity measured by max token-overlap with train data)
Model T5-small and T5-11B
Evaluation Automatic: Rouge-L Human: Amazon Mechanical Turk people. Metric: Case where humans preferred model generated response over the answer. 50%: human parity
source: https://arxiv.org/abs/2107.02153
Information-seeking Questions may contain ambiguity
Disambiguation: Since the original Q is unclear, we create multiple potential (q, a) pairs
Crossover of Disambiguation: given (q1, a1) and (q2, a2) —> (q1, a2) and (q2, a1) would be invalid
Low lexical bias than crowdsourced claims (model with no background knowledge performs similarity to random guessing)
FAVIQ: Ambiguous Dataset (A) + Regular Dataset (R)
Local Mutual Information: measure of correlation between bigram and label (measure lexical bias)
FAVIQ has notably lower LMI than FEVER (less biased)
SNOPES: 6,422 claims annotated by professional fact-checkers, gold evidence given
SCIFACT: 1,109 claims based on scientific papers, annotated by domain experts, gold evidence X given (use TF-IDF on sentence level)
Model: BART
NLU benchmarks become quickly obsolete (especially after the arrival of BERT)
Many SOTA models exploit spurious statistical patterns in datasets
Proposal: adversarial human-and-model-in-the-loop —> NLU dataset
If target label == Model(premise, hypothesis) —> add to train Elif Verification(premise, hypothesis, target label) —> Add to train, dev, test
FEVER: a large-scale dataset for Fact Extraction and Verification
source: https://arxiv.org/abs/1803.05355
Abstract
Each claim —classify—> (Supported, Refuted, Not enough info)
For Supported and Refuted, evidence is also provided
Introduction
For Textual Entailment / NLI, the paragraph containing the evidence is explicitly provided. On the other hand, Textual Claim Verification searches through a vast corpus to identify the evidence.
For Question Answering, the question contains information needed to identify the answer, but information missing from a claim can often be crucial in retrieving refuting evidence.
Pipeline:
Claim Generation
Processed 2017/06 Wikipedia dump with Stanford CoreNLP Sampled Sentence from introductory sections of 50,000 popular pages
Randomly choose a sentence —> make annotators generate a set of claims about the page’s subject
Tradeoff: simple generation —> paraphrase, too easy vs use world knowledge —> too hard Solution: “dictionary”, provide a list of terms hyperlinked in the original sentence & 1st sentence of their respective wikipedia entries.
Mutations were also generated: paraphrasing, negation, substitution of entity with a similar/dissimilar one, make claim more general/specific
Annotators had trouble generating non-trivial negation mutations (not ~~) Tried to discourage this behavior
Claim Labeling
(Supported, Refuted, Not Enough Info (the claim was too general/specific to be supported or refuted))
Questions for annotators:
Annotators were allowed to add other urls to support claims. Annotators were encouraged to keep time under 2 min per claim
Data Validation
Validation of claim-generation: implicitly done in claim-labeling process. 3 ways to validate the claim-labeling process:
Baseline System Description
Pipeline: document retrieval —> sentence selection —> recognizing textual entailment
NotEnoughInfo does not have corresponding evidence —> error in input format Solution: sample a sentence from nearest page / sample a sentence at random
Best performance: 31.87% for verification & evidence detection task : 50.91% for verification only
The most challenging part is evidence identification