sheffieldnlp / naacl2018-fever

Fact Extraction and VERification baseline published in NAACL2018
http://fever.ai
Apache License 2.0
127 stars 41 forks source link

Error Analysis #31

Closed j6mes closed 6 years ago

j6mes commented 6 years ago
j6mes commented 6 years ago
Metric NLTK DRQA Sents Precomputed IDF DRQA Sents New IDF
Runtime 2 hours 10 hours 12 hours
Strict Accuracy (strict) requirement for correct evidence 0.2476 0.1827 0.2698
Classification Accuracy Without Need For Evidence 0.4885 0.4588 0.4922
Correct Document Return Rate (dmatch) 0.5793 0.5893 0.5893
Correct Document Return Rate after sentence selection (smatch) 0.4773 0.2690 0.5596
Correct Text Return Rate (for Refutes/Supports) 0.3647 0.1083 0.4680
j6mes commented 6 years ago

@andreasvlachos using DrQA instead of NLTK for sentence selection gives us a 2% boost - at the cost of an extra 10 hours. dmatch and smatch figures give us upper bounds for strict accuracy (considering the supported/refuted class). In the case of DrQA - the number of times the correct document is in the evidence after sentence selection is 55% of the time whereas using NLTK, this is only 47%.