Closed debanjum closed 6 days ago
We need better, automated evals to measure performance shifts of Khoj across prompt, model and capability changes.
Google's FRAMES benchmark evaluates multi-step retrieval and reasoning capabilities of AI agents. It's a good starter benchmark to evaluate Khoj.
This PR adds an eval script to evaluate Khoj responses on the the FRAMES benchmark prompts against the ground truth provided by it.
Gemini is used as an LLM Judge to auto grade Khoj responses vs ground truth data from the benchmark.
Why
We need better, automated evals to measure performance shifts of Khoj across prompt, model and capability changes.
Google's FRAMES benchmark evaluates multi-step retrieval and reasoning capabilities of AI agents. It's a good starter benchmark to evaluate Khoj.
Details
This PR adds an eval script to evaluate Khoj responses on the the FRAMES benchmark prompts against the ground truth provided by it.
Gemini is used as an LLM Judge to auto grade Khoj responses vs ground truth data from the benchmark.