Add ELI5 test benchmark

gomate-community / rageval

Evaluation tools for Retrieval-augmented Generation (RAG) methods.

Apache License 2.0

81 stars 9 forks source link

This issue is to add ELI5 benchmark to cover all evaluation dimensions.

@QianHaosheng @Wenshansilvia We can discuss it in detail.

The pipeline of the test maybe as follows:

>>> eli5_dataset = load_dataset('ELI5', 'test')

# load the model which needs to be evaluated
>>> model = load_model('mistral')
>>> eli5_dataset = model.predict(eli5_dataset)

# 1) test answer rouge correctness
>>> func = gt_answer_claims_extraction()
>>> eli5_dataset = eli5_dataset.map(lambada: example, gt_answer_claims_extraction)
>>> import rageval
>>> task = rageval.tasks._generator(metrics = ['_answer_rouge_correctness'])
>>> result = task.evaluate(eli5_dataset)

# 2) test context recall
>>> task.set_metric(['_context_f1_recall'])
>>> result = task.evaluate(eli5_dataset)
...

gomate-community / rageval

Add ELI5 test benchmark #62