Evaluating the QG models

Hi! I've started working on my own QG algorithm for my Masters thesis and I'm trying to learn how to evaluate a model.

Since you've posted your metrics, I've been trying to replicate them, but I am having some troubles. Can you tell me what was your approach?

I am using nlg-eval as well and looking at your references.txt I can conclude that you are making a hypothesis for each question in the SQuAD dev set.

I ran your algorithm (t5-base) on the first 1000 references by

Passing only the sentence the answer of the original SQuAD question is located
Taking the first generated question from your model (since you are generating multiple suitable questions)

The results I got from nlg-eval were:

Bleu_1: 0.295304
Bleu_2: 0.202987
Bleu_3: 0.144872
Bleu_4: 0.107891
METEOR: 0.178506
ROUGE_L: 0.290285
CIDEr: 0.955470

Do you pass only the sentence the SQUAD question is located in or perhaps the entire paragraph? And which question do you take as your hypothesis out of all the generated ones?

patil-suraj / question_generation

Evaluating the QG models #75