Error Analysis - Githubissues

sheffieldnlp / naacl2018-fever

Fact Extraction and VERification baseline published in NAACL2018

http://fever.ai

Apache License 2.0

127 stars 41 forks source link

Error Analysis #31

Closed j6mes closed 6 years ago

j6mes commented 6 years ago

[x] how often did DR return the right page?
[x] how often did SR return the right page?
[x] how often did SR return the original evidence?
[ ] for the times where SR returned different evidence. What are the differences between BLEU/ROUGE similarities between the claim and returned evidence vs claim and gold evidence?
[ ] Error coding scheme

j6mes commented 6 years ago

Metric	NLTK	DRQA Sents Precomputed IDF	DRQA Sents New IDF
Runtime	2 hours	10 hours	12 hours
Strict Accuracy (strict) requirement for correct evidence	0.2476	0.1827	0.2698
Classification Accuracy Without Need For Evidence	0.4885	0.4588	0.4922
Correct Document Return Rate (dmatch)	0.5793	0.5893	0.5893
Correct Document Return Rate after sentence selection (smatch)	0.4773	0.2690	0.5596
Correct Text Return Rate (for Refutes/Supports)	0.3647	0.1083	0.4680

j6mes commented 6 years ago

@andreasvlachos using DrQA instead of NLTK for sentence selection gives us a 2% boost - at the cost of an extra 10 hours. dmatch and smatch figures give us upper bounds for strict accuracy (considering the supported/refuted class). In the case of DrQA - the number of times the correct document is in the evidence after sentence selection is 55% of the time whereas using NLTK, this is only 47%.