[Research Spike] Explore metrics to analyze different question answering model outputs - Githubissues

redhat-et / foundation-models-for-documentation

Improve ROSA customer experience (and customer retention) by leveraging foundation models to do “gpt-chat” style search of Red Hat customer documentation assets.

Other

26 stars 12 forks source link

[Research Spike] Explore metrics to analyze different question answering model outputs #2

Closed Shreyanand closed 1 year ago

Shreyanand commented 1 year ago

In order to compare different QA models for user documentation, we will have to decide on quantitative metrics that evaluate the models. As a part of this issue add resources and discussion points for the metrics to be used. Add a notebook that show some of these metrics.

Shreyanand commented 1 year ago

BARTscore: https://arxiv.org/pdf/2106.11520.pdf ROUGE: https://aclanthology.org/W04-1013.pdf BLEU, BERTScore, METEOR, WMD Blog to read: https://blog.paperspace.com/automated-metrics-for-evaluating-generated-text/ Metrics Survey: https://arxiv.org/pdf/2006.14799.pdf , https://aclanthology.org/2021.triton-1.6.pdf Medium Implementation blog: https://medium.com/@vincentchen0110/evaluating-your-text-generation-results-simple-as-that-e74547383181 Human performance ratings based on chatGPT style architecture (good read but may not be applicable): https://arxiv.org/pdf/2203.02155.pdf?fbclid=IwAR2nZdBpdZZzvxpwI6H_bRmP4RwGOyzke9Ud63lWBe1YlyI_1BRAFhnUMUg Haystack and Deepset implementation of QA framework: https://www.deepset.ai/blog/how-to-evaluate-question-answering They use EM, F1, and SAS metrics (F1 here is on the overlap of bag of words of predicted and ground truth so it is more forgiving than Exact match)

Shreyanand commented 1 year ago

@suppathak Is there a consensus in the latest literature on this that we can follow? Are there more resources on this topic we should go through?

suppathak commented 1 year ago

@suppathak Is there a consensus in the latest literature on this that we can follow? Are there more resources on this topic we should go through?

here are some more resource that I would like to include: KPQA: https://github.com/hwanheelee1993/KPQA,

link related to BERTScore: https://huggingface.co/spaces/evaluate-metric/bertscore https://github.com/Tiiiger/bert_score

Shreyanand commented 1 year ago

Langchain Evaluation Chains

codificat commented 1 year ago

I believe that this is mostly complete with #16.

@suppathak is there anything missing? e.g. I don't see KPQA mentioned in the notebook from #16, but BERTScore is there

Shreyanand commented 1 year ago

@codificat #16 and commits in #28 completes this issue 🎉