sheffieldnlp / naacl2018-fever

Fact Extraction and VERification baseline published in NAACL2018
http://fever.ai
Apache License 2.0
127 stars 41 forks source link

Rationalise Scripts and Run Final Experiments #15

Closed j6mes closed 6 years ago

j6mes commented 6 years ago

To run

Extra:

andreasvlachos commented 6 years ago

Just checking: we are not planning on learning DR right? That's fine, but would be good though to ensure that the DR component is good enough for the entailment part. I.e., given an oracle RTE part, what is the accuracy given the DR we have? Should be better than random baseline, right? A related question, is there some kind of threshold to restrict the documents we get from DR? Or do we take the top one only? (probably a good start assuming it gives us decent accuracy with an oracle RTE)

On Tue, 5 Dec 2017 at 11:40 James Thorne notifications@github.com wrote:

To run

  • MLP: Train on FNC, Evaluate on FNC, Evaluate on FEVER 3 way
  • MLP: Train on FEVER with sampled negative pages, Test
  • MLP: Train on FEVER with IR negative pages, Test
  • DR: Final score for recall/precision/MRR
  • RTE: Pre-trained model, evaluate on FEVER
  • RTE: Train on FEVER bodies, evaluate on FEVER

Extra:

  • BiDAF: Precision/Recall of pretrained model
  • BiDAF: FEVER Accuracy using pretrained model on DRQA Pages
  • RTE: Train on BiDAF retrieved model: evaluate P/R of BiDAF. Evaluate FEVER score

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sheffieldnlp/fever-baselines/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/ABbUhWXtLlR0zvc3KPHpLqLoi0YC9mclks5s9SuqgaJpZM4Q2J6c .

j6mes commented 6 years ago

The DR has no parameters, so there's nothing to learn. Taking the top 5 articles at the moment. Will also try taking all articles above a threshold.

The only metric I've done is recall the recall, but testing with an oracle RTE is a good idea and easy for me to do too.