Description

Adds a new pack: RagEvaluatorPack:

Given:

A rag_dataset (i.e., LabelledRagDataset)
A query_engine (i.e., BaseQueryEngine) built off the same source Document's as the rag_dataset
Optionally an LLM to be used as the judge (defaults to OpenAI gpt-4)

Returns: Benchmark results for:

Context similarity
Correctness
Faithfulness
Relevancy

(Same metrics shown in the Dataset Card)

Fixes # (issue)

Type of Change

Please delete options that are not relevant.

[x] New Pack

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

[x] Added new notebook (that tests end-to-end) (in main framework)
[x] I stared at the code and made sure it makes sense

run-llama / llama-hub

RagEvaluatorPack #683

Description

Type of Change

How Has This Been Tested?