seominjoon / piqa

Phrase-Indexed Question Answering (PIQA)
https://pi-qa.com
Apache License 2.0
95 stars 9 forks source link

Performance difference between evaluate.py vs piqa_evaluate.py #3

Open jhyuklee opened 6 years ago

jhyuklee commented 6 years ago

Performances of two evaluation scripts differ as follows:

$ python evaluate.py $SQUAD_DEV_PATH /tmp/piqa/pred.json 
{"exact_match": 52.81929990539262, "f1": 63.28879733489547}
$ python piqa_evaluate.py $SQUAD_DEV_PATH /tmp/piqa/context_emb/ /tmp/piqa/question_emb/
{"exact_match": 52.28949858088931, "f1": 62.72236634535493}

Difference is about 0.5~0.6, and tested model is LSTM+SA+ELMo.

jhyuklee commented 6 years ago
$ python evaluate.py $SQUAD_DEV_PATH /tmp/piqa/pred.json
{"exact_match": 53.207190160832546, "f1": 63.382281758599724}

$ python piqa_evaluate.py $SQUAD_DEV_PATH /tmp/piqa/context_emb/ /tmp/piqa/question_emb/
{"exact_match": 53.39640491958373, "f1": 63.51748187339812}

Sometimes it goes up.

jhyuklee commented 6 years ago

Have found that prediction json files are quite different, too. More than 600 preds out of 10570 preds are different between (original) pred.json and piqa version pred.json. One possible cause is the different test behaviors (outer product of start, end probs vs. inner product of phrase and query vecs). Outer product of start, end probs (which was desgined for efficient learning and testing) could result in a different ranking compared to the inner product.