Fact checker for simple claims about statistical properties
This repository contains the code and data needed to reproduce the results of the paper:
Identification and Verification of Simple Claims about Statistical Properties
Andreas Vlachos and Sebastian Riedel, EMNLP 2015
Preprocessing:
Then we run the following bits of Java from the HTML2Stanford: HTML2Text (need the BoilerPipe jar) Text2Parsed2JSON (careful to use the CollapsedCCproccessed dependencies, best a more recent version of Stanford CoreNLP (>3.5) that outputs straight to json)
From this we obtain a large number of html pages, converted to text, parsed with Stanford CoreNLP.
And then:
buildMatrix.py: This processes the preprocessed HTML pages and builds a json file which is a dictionary from pattern (string or lexicalized dependencies) to countries/locations and then to the values.
matrixFiltering.py: this takes the matrix from the previous step and filters its values and patterns to avoid those without enough entries or those whose entries have too much deviation so they cannot be sensibly averaged. Also uses the aliases to merge the values for different location names used in the experiments. From this we get the file data/theMatrixExtend120TokenFiltered_2_2_0.1_0.5_fixed2.json.
Split the data from Freebase (data/allCountriesPost2010-2014Filtered15-150.json) into training/dev (data/train.json) and test (data/test.json).
To reproduce the IE-style evaluation results
python src/main/fixedValuePredictor.py data/train.json data/theMatrixExtend120TokenFiltered_2_2_0.1_0.5_fixed2.json data/test.json out/informedGuess
python src/main/baselinePredictor.py data/train.json data/theMatrixExtend120TokenFiltered_2_2_0.1_0.5_fixed2.json data/test.json out/unadjustedMAPE FALSE
python src/main/baselinePredictor.py data/train.json data/theMatrixExtend120TokenFiltered_2_2_0.1_0.5_fixed2.json data/test.json out/adjustedMAPE TRUE
To run the fact-checker on the HTML pages obtained from the web:
First create a directory for the output, i.e.:
mkdir out
Then run
python src/main/factChecker.py data/allCountriesPost2010-2014Filtered15-150.json data/theMatrixExtend120TokenFiltered_2_2_0.1_0.5_fixed2.json population 0.03125 data/htmlPages2textPARSEDALL data/locationNames data/aliases.json out/population.tsv
The directory data/htmlPages2textPARSEDALL is not on github due to its size (1.6GB compressed), but feel free to ask me for it.
This is run for each of the 16 properties independently. The parameter for adjusted MAPE used in the paper was set according to the IE experiments. Here is the table the setting for each property:
The output for each relation is a .tsv file which can be loaded in Excel. We did this and labeled the claims. The files from which the results in Table 2 are obtained are in data/labeled_claims.