gomate-community / rageval

Evaluation tools for Retrieval-augmented Generation (RAG) methods.
Apache License 2.0
81 stars 9 forks source link

Add ELI5 test benchmark #62

Closed faneshion closed 3 months ago

faneshion commented 4 months ago

This issue is to add ELI5 benchmark to cover all evaluation dimensions.

@QianHaosheng @Wenshansilvia We can discuss it in detail.

The pipeline of the test maybe as follows:

>>> eli5_dataset = load_dataset('ELI5', 'test')

# load the model which needs to be evaluated
>>> model = load_model('mistral')
>>> eli5_dataset = model.predict(eli5_dataset)

# 1) test answer rouge correctness
>>> func = gt_answer_claims_extraction()
>>> eli5_dataset = eli5_dataset.map(lambada: example, gt_answer_claims_extraction)
>>> import rageval
>>> task = rageval.tasks._generator(metrics = ['_answer_rouge_correctness'])
>>> result = task.evaluate(eli5_dataset)

# 2) test context recall
>>> task.set_metric(['_context_f1_recall'])
>>> result = task.evaluate(eli5_dataset)
...
faneshion commented 4 months ago

@QianHaosheng As ELI5 is a part of KILT, we should move the ELI5 benchmark into KILT benchmark, where the structure of the directory looks like:

- rageval
  - rageval 
  - benchmarks
     - KILT
        - FEVER
        - ELI5
        - ... 
     - ASQA
     - BBQ
     - ... 
  - tests
  - ...

It is worth to note that Dataset "eli5" is defunct and no longer accessible from huggingface. We can still download the train and validation dataset from this repo (https://github.com/facebookresearch/KILT?tab=readme-ov-file).