List all potential test benchmarks - Githubissues

gomate-community / rageval

Evaluation tools for Retrieval-augmented Generation (RAG) methods.

Apache License 2.0

136 stars 11 forks source link

List all potential test benchmarks #63

Open faneshion opened 8 months ago

faneshion commented 8 months ago

List all most used datasets in RAG researches, and we will add them to the benchmarks.

[ ] THUDM/webglm-qa from huggingface: https://huggingface.co/datasets/THUDM/webglm-qa
[ ] NaturalQuestions from huggingface: https://huggingface.co/datasets/natural_questions
[ ] #64
[ ] Trivia QA from huggingface: https://huggingface.co/datasets/trivia_qa
[ ] Hotpot QA from huggingface: https://huggingface.co/datasets/hotpot_qa
[ ] WikiEval from huggingface: https://huggingface.co/datasets/explodinggradients/WikiEval

FBzzh commented 8 months ago

[ ] MMLU from huggingface: https://huggingface.co/datasets/cais/mmlu
[ ] PopQA from huggingface: https://huggingface.co/datasets/akariasai/PopQA
[ ] WebQuestions from hugginggace: https://huggingface.co/datasets/web_questions
[ ] FEVER from hugginggace: https://huggingface.co/datasets/fever
[ ] FeTaQA from hugginggace: https://huggingface.co/datasets/DongfuTingle/FeTaQA

FBzzh commented 8 months ago

[ ] MedMCQA from hugginggace: https://huggingface.co/datasets/medmcqa
[ ] GSM8K from hugginggace: https://huggingface.co/datasets/gsm8k
[ ] BBH from github: https://github.com/suzgunmirac/BIG-Bench-Hard
[ ] SQuAD from hugginggace: https://huggingface.co/datasets/squad
[ ] SQuAD_v2 from hugginggace: https://huggingface.co/datasets/squad_v2
[ ] Wizard-of-Wikipedia(WoW) from hugginggace: https://huggingface.co/datasets/chujiezheng/wizard_of_wikipedia

Wenshansilvia commented 8 months ago

Select and implement typical benchmarks, collect RAG papers that utilized these benchmarks, and try to reproduce evaluation result in the paper.

List benchmark and related papers & metrics.
Produce testset using baseline RAG in the paper. Pack testset as dataset format and upload to HuggingFace.
Reproduce evaluation result in RAGEval.

Eli5 @QianHaosheng , ASQA @bugtig6351 , Fever @henan991201