This project allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing.
.env
file in the project root with the following variables:WANDB_EMAIL=your_wandb_email
WANDB_API_KEY=your_wandb_api_key
OPENAI_API_KEY=your_openai_api_key
To start the annotation app, run:
python main.py
This will launch a web interface for annotating your dataset.
To programmatically create an LLM judge from your wandb dataset annotations:
forge_evaluation_judge.ipynb
in a Jupyter environment.This will generate a judge like the one in forged_judge
.
To load and run the generated judge:
run_forged_judge.ipynb
in a Jupyter environment.This will evaluate your dataset using the forged judge, with results fully tracked and traced using Weave.
main.py
: Annotation appforge_evaluation_judge.ipynb
: Judge creation notebookrun_forged_judge.ipynb
: Judge execution notebookAll components are integrated with Weave for comprehensive tracking and tracing of your machine learning workflow.
Happy evaluating! π