Open jhyuklee opened 6 years ago
$ python evaluate.py $SQUAD_DEV_PATH /tmp/piqa/pred.json
{"exact_match": 53.207190160832546, "f1": 63.382281758599724}
$ python piqa_evaluate.py $SQUAD_DEV_PATH /tmp/piqa/context_emb/ /tmp/piqa/question_emb/
{"exact_match": 53.39640491958373, "f1": 63.51748187339812}
Sometimes it goes up.
Have found that prediction json files are quite different, too. More than 600 preds out of 10570 preds are different between (original) pred.json and piqa version pred.json. One possible cause is the different test behaviors (outer product of start, end probs vs. inner product of phrase and query vecs). Outer product of start, end probs (which was desgined for efficient learning and testing) could result in a different ranking compared to the inner product.
Performances of two evaluation scripts differ as follows:
Difference is about 0.5~0.6, and tested model is LSTM+SA+ELMo.