In this PR, the ELI5 benchmark is implemented mainly based on ALCE repo.
Support both openai api and local LLM now
Add cache directory .rageval
Among them, datasets are used to store original data sets, models are used to store local LLM, NLI and other models, and results are used to store results generated by the RAG system.
During the evaluation phase, users can evaluate directly after generating results or evaluate past results files.
Some code in the NLI part has been slightly modified, since more NLI models are generation models rather than classification models, and now support the text2text-generation task. And NLI models are now supported for computing on GPUs now.
We have now obtained the results of the VANILLA method of llama2-7b-chat on the ELI5 dataset. The claim recall metric and the citation recall metric are consistent with the data in the paper. The citation precision metric looks a little abnormally high. I will check and try to fix it. In addition, the results of gpt-3.5-turbo will also be uploaded later.
In addition, during the test, it was found that the evaluation process was very slow and the user experience was not good. It will be necessary to support parallel computing for these metrics in later work.
In this PR, the ELI5 benchmark is implemented mainly based on ALCE repo.
.rageval
Among them,datasets
are used to store original data sets,models
are used to store local LLM, NLI and other models, andresults
are used to store results generated by the RAG system. During the evaluation phase, users can evaluate directly after generating results or evaluate past results files.or
claim recall
metric and thecitation recall
metric are consistent with the data in the paper. Thecitation precision
metric looks a little abnormally high. I will check and try to fix it. In addition, the results of gpt-3.5-turbo will also be uploaded later.In addition, during the test, it was found that the evaluation process was very slow and the user experience was not good. It will be necessary to support parallel computing for these metrics in later work.