Open pdasigi opened 2 years ago
For the evaluation setup, the following papers can be insightful for the creation of an Answer Equivalence task:
Another idea for QE could be to calibrate the final model score by fitting a model that uses the values of the different metrics to compute the model confidence... Since each metric can have different strengths or weaknesses, combining the different models (e.g., via linear combination)
High priority:
Medium priority:
Low priority:
After we have an evaluation setup: