Rationalise Scripts and Run Final Experiments

j6mes commented 6 years ago

To run

[ ] MLP: Train on FNC, Evaluate on FNC, Evaluate on FEVER 3 way
[x] MLP: Train on FEVER with sampled negative pages, Test
[x] MLP: Train on FEVER with IR negative pages, Test
[x] DR: Final score for recall/precision/MRR
[x] DR: Score using Oracle RTE component
[ ] RTE: Pre-trained model, evaluate on FEVER
[x] RTE: Train on FEVER bodies, evaluate on FEVER

Extra:

BiDAF: Precision/Recall of pretrained model
BiDAF: FEVER Accuracy using pretrained model on DRQA Pages
RTE: Train on BiDAF retrieved model: evaluate P/R of BiDAF. Evaluate FEVER score

andreasvlachos commented 6 years ago

Just checking: we are not planning on learning DR right? That's fine, but would be good though to ensure that the DR component is good enough for the entailment part. I.e., given an oracle RTE part, what is the accuracy given the DR we have? Should be better than random baseline, right? A related question, is there some kind of threshold to restrict the documents we get from DR? Or do we take the top one only? (probably a good start assuming it gives us decent accuracy with an oracle RTE)

On Tue, 5 Dec 2017 at 11:40 James Thorne notifications@github.com wrote:

To run

MLP: Train on FNC, Evaluate on FNC, Evaluate on FEVER 3 way

MLP: Train on FEVER with sampled negative pages, Test

MLP: Train on FEVER with IR negative pages, Test

DR: Final score for recall/precision/MRR

RTE: Pre-trained model, evaluate on FEVER

RTE: Train on FEVER bodies, evaluate on FEVER

Extra:

BiDAF: Precision/Recall of pretrained model

BiDAF: FEVER Accuracy using pretrained model on DRQA Pages

RTE: Train on BiDAF retrieved model: evaluate P/R of BiDAF. Evaluate FEVER score

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sheffieldnlp/fever-baselines/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/ABbUhWXtLlR0zvc3KPHpLqLoi0YC9mclks5s9SuqgaJpZM4Q2J6c .

j6mes commented 6 years ago

The DR has no parameters, so there's nothing to learn. Taking the top 5 articles at the moment. Will also try taking all articles above a threshold.

The only metric I've done is recall the recall, but testing with an oracle RTE is a good idea and easy for me to do too.

sheffieldnlp / naacl2018-fever

Rationalise Scripts and Run Final Experiments #15