khoj-ai / khoj

Your AI second brain. Self-hostable. Get answers from the web or your docs. Build custom agents, schedule automations, do deep research. Turn any online or local LLM into your personal, autonomous AI (e.g gpt, claude, gemini, llama, qwen, mistral).
https://khoj.dev
GNU Affero General Public License v3.0
14.7k stars 729 forks source link

Add Script to Evaluate Khoj on Google's FRAMES benchmark #955

Closed debanjum closed 6 days ago

debanjum commented 1 week ago

Why

We need better, automated evals to measure performance shifts of Khoj across prompt, model and capability changes.

Google's FRAMES benchmark evaluates multi-step retrieval and reasoning capabilities of AI agents. It's a good starter benchmark to evaluate Khoj.

Details

This PR adds an eval script to evaluate Khoj responses on the the FRAMES benchmark prompts against the ground truth provided by it.

Gemini is used as an LLM Judge to auto grade Khoj responses vs ground truth data from the benchmark.