TODO - Githubissues

pdasigi commented 2 years ago

High priority:

[ ] Look into MOCHA and finalize how much we can use it for evaluation.
[ ] If we decide we need additional human annotations like MOCHA, finalize the annotation scheme.
[ ] Finalize OOD evaluation sets for each of the datasets we are training quality estimators for (MS MARCO, NarrativeQA, Qasper)

Medium priority:

[ ] Augment training data for the (question, context, prediction) -> F1 score model and see if we can do better on Qasper
[ ] LSTM with attention on final layer features -> F1 score for Qasper
[ ] Train QCP -> F1 model on MS MARCO and NarrativeQA

Low priority:

[ ] Include LM score as an additional feature in the MLP quality estimator

After we have an evaluation setup:

[ ] Compare different target metrics for quality estimation (e.g.: Is regressing against ROUGE better than doing it against F1?): Evaluation metric will be rank correlation (possibly binned) against human annotations.
[ ] Compare multi-tasking on regressing against metrics with calibrating on individual metrics. Evaluated based on rank correlation against human annotations.

PastelBelem8 commented 2 years ago

For the evaluation setup, the following papers can be insightful for the creation of an Answer Equivalence task:

Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation (Bulian et al 2022): propose an Answer Equivalence (AE) task and corresponding SQUAD-based human annotated dataset. They advocate for assymetric metrics, which reflect partially correct/incorrect, missing information, extra information categories in the annotation procedure.
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics (Chen et al. 2020): repurpose different MC and extractive settings to be generative and collect human annotations for equivalent answers in a 5-point likert scale.

PastelBelem8 commented 2 years ago

Another idea for QE could be to calibrate the final model score by fitting a model that uses the values of the different metrics to compute the model confidence... Since each metric can have different strengths or weaknesses, combining the different models (e.g., via linear combination)

pdasigi / eqqa

TODO #1