Initial design - Githubissues

simonw commented 2 months ago

The idea here is to create a lightweight system for running evals against models using LLM.

llm evals example.yml \
  -m claude-3-opus -m gpt-4-turbo

This would run the evals described in that YAML file against those two models, outputting the results to stdout and also storing them in the default LLM SQLite database.

-d evals.db would store to a different specified database.

Much of the challenge will be designing that YAML format (JSON will be supported too, for people who really can't stand YAML). Especially how the evals define their assertions for checking if they succeeded.

simonw commented 2 months ago

Evals need to support parameters. There are lots of interesting evals out there like MMLU which are defined as CSV files - being able to reference CSVs from within the YAML would be a neat way of avoiding having to duplicate the actual test cases directly in the YAML.

In as many cases as possible file paths and URLs should both be supported - so you can fire off an eval run against one that's hosted online with a single command.

simonw commented 2 months ago

Possible format:

evals_version: 0.1
evals:
- name: increment
  prompt: |
    What comes next after 5885?
  expect:
  - contains: 5886
- name: decrement
  prompt: |
    What comes before 4284?
  expect:
  - contains: 4283

But there are so many ways that we might want to evaluate a response:

Simple string comparisons
Multiple choice, like in MMLU (one of A/B/C/D)
Send a prompt to another LLM - now we need to know how to interpret its response too!
Run some code - maybe embedded JavaScript in a QuickJS sandbox?

simonw commented 2 months ago

For testing https://github.com/datasette/datasette-query-assistant a plugin that can execute the returned SQL query and assert against the result would be ideal.

simonw commented 2 months ago

Tricky edge-case: how to evaluate the thing where datasette-query-assistant verifies returned SQL with an explain and round-trips error messages through the model up to three times.

Might need another plugin mechanism for that.

simonw commented 2 months ago

I think each eval should occupy its own file. I don't think named evals in a list are actually that useful.

simonw commented 2 months ago

I'm going to want a mechanism for checking if evals should be run again. I'm tempted to use SHA256 of the contents of the eval file - that way I can store that hash somewhere, use it as a unique ID and identify if a new set of evals should be executed.

This will be good for evals that can run in CI.

simonw commented 2 months ago

Once flattened an eval would look like this:

evals_version: 0.1
name: increment
prompt: |
  What comes next after 5885?
expect:
- contains: 5886

Alternatives for expect:

assert
verify
checks
must

I quite like checks, which could look something like this:

evals_version: 0.1
name: pelicans
prompt: |
  Three names for a pet pelican, numbered
checks:
- contains: 1.
- contains: 2.
- contains: 3.
- not-contains: 4.

If each of these checks is a key: details then the key could correspond to the registered check provided by a plugin.

So for my SQL things it could look like this - assuming a plugin that adds sqlite_execute assertions and sqlite_setup setup mechanism:

ev: 0.1 #ev is short for evals_version
name: SQLite SQL
system: |
  You return SQL select statements for SQLite
  Database schema: create table articles (id integer primary key, title text, body text, created_yyyymmdd text)
prompt: |
  Count articles published in 2023
prefill: "select "
setup:
  sqlite_setup: |
    create table articles (id integer primary key, title text, body text, created_yyyymmdd text)
    insert into articles (title, body, created_yyyymmdd) values
      ("One", "Body one", "2023-01-01"),
      ("Two", "Body two", "2023-02-01"),
      ("Three", "Body three", "2024-01-01"),
checks:
- sqlite_execute: [[2]]
- contains: count(*)

simonw commented 2 months ago

I want a parameterization mechanism, so you can run the same eval against multiple examples at once. Those examples can be stored directly in the YAML or can be referenced as the filename or URL to a CSV or JSON file.

I realize that, just like with pytest, I'd like to be able to apply multiple parameter groups at once - so I could eg define a set of 100 SQL query questions and assertions and then also provide 20 possible system prompts and run the whole matrix of 2,000 responses to see which system prompt scores highest.

simonw commented 2 months ago

I'm going to try sandboxed Jinja embedded in the YAML or JSON as the parameter mechanism.

Not sure how best to handle the multiple parameter groups mechanism. Feels like a CLI design problem.

simonw commented 2 months ago

I think each check is a pass/fail, but an eval can support a percentage passed score or those checks in addition to an overall pass/fail showing if all the checks passed or not.

This means the database schema that stores the results needs to have a concept of those checks so it can store results for each one of them.

simonw commented 2 months ago

Beginning to form a vocabulary (WIP):

Task: a task we want to use an LLM to solve, defined as a YAML or JSON file
Prompt: potential prompt (and system prompt and prefill and set of few-shot examples and temperature options) that might achieve the task
Example: an example parameter input and expected output to use to evaluate a prompt - eg a question and expected SQL answer
Check: a pass/fail check to run against the prompt response

simonw commented 2 months ago

I'm currently overthinking how to store those checks and results.

I'm going to be storing a lot of these potentially - for 20 prompts against 100 examples with 3 checks per example I'd store 6,000 check results.

I think a check_results table with integer foreign keys to the check, the prompt, the example and a 1 or 0 column for the result. That's 4 integers stored per check.

simonw commented 2 months ago

With the initial prototype:

ev: 0.1
name: Basic languages
system: |
  Return just a single word in the specified language
prompt: |
  Apple in Spanish
checks:
- iexact: manzana
- notcontains: apple

Then run:

llm evals simple.yml -m 4t -m chatgpt

(Currently requires OPENAI_API_KEY environment variable to be set.)

Output:

('gpt-4-turbo-preview', [True, True])
('gpt-3.5-turbo', [True, True])

It doesn't yet save results to a database, but it's illustrating the basic checks mechanism - which is defined by a bunch of classes registered with the new register_eval_checks() plugin hook:

https://github.com/simonw/llm-evals/blob/6ac691e53cbb4de798d537b2f24b6e87168ffd3f/llm_evals/checks.py#L1-L49

https://github.com/simonw/llm-evals/blob/6ac691e53cbb4de798d537b2f24b6e87168ffd3f/llm_evals/__init__.py#L56-L58

simonw commented 2 months ago

After llm install llm-claude-3 this works:

llm evals simple.yml -m 4t -m chatgpt -m haiku -m opus

('gpt-4-turbo-preview', [True, True])
('gpt-3.5-turbo', [True, True])
('claude-3-haiku-20240307', [True, True])
('claude-3-opus-20240229', [True, True])

simonw commented 2 months ago

PyPI rejected my alpha relese because the name was too similar to https://pypi.org/project/llmevals/

simonw commented 2 months ago

https://pypi.org/project/llmeval/ also exists.

simonw commented 2 months ago

Putting repo private again while I figure out a new name.

simonw commented 2 months ago

Closing this, work will continue in other issues.

zain commented 2 months ago

With the initial prototype:

ev: 0.1
name: Basic languages
system: |
  Return just a single word in the specified language
prompt: |
  Apple in Spanish
checks:
- iexact: manzana
- notcontains: apple

I like this approach! I actually created something very similar to llm-evals-plugin for my day job, and my yaml evals look almost exactly like this.

I assume you're thinking about this, but I made checks easily pluggable. That lets me implement new checks very easily. One very handy check is the llm check which uses another (ideally better) model to check the output of this one. So then I can make checks like: - llm: the second paragraph of the response mentions the importance of eating an apple every day

simonw / llm-evals-plugin

Initial design #1