Closed simonw closed 2 months ago
Evals need to support parameters. There are lots of interesting evals out there like MMLU which are defined as CSV files - being able to reference CSVs from within the YAML would be a neat way of avoiding having to duplicate the actual test cases directly in the YAML.
In as many cases as possible file paths and URLs should both be supported - so you can fire off an eval run against one that's hosted online with a single command.
Possible format:
evals_version: 0.1
evals:
- name: increment
prompt: |
What comes next after 5885?
expect:
- contains: 5886
- name: decrement
prompt: |
What comes before 4284?
expect:
- contains: 4283
But there are so many ways that we might want to evaluate a response:
For testing https://github.com/datasette/datasette-query-assistant a plugin that can execute the returned SQL query and assert against the result would be ideal.
Tricky edge-case: how to evaluate the thing where datasette-query-assistant
verifies returned SQL with an explain
and round-trips error messages through the model up to three times.
Might need another plugin mechanism for that.
I think each eval should occupy its own file. I don't think named evals in a list are actually that useful.
I'm going to want a mechanism for checking if evals should be run again. I'm tempted to use SHA256 of the contents of the eval file - that way I can store that hash somewhere, use it as a unique ID and identify if a new set of evals should be executed.
This will be good for evals that can run in CI.
Once flattened an eval would look like this:
evals_version: 0.1
name: increment
prompt: |
What comes next after 5885?
expect:
- contains: 5886
Alternatives for expect
:
assert
verify
checks
must
I quite like checks
, which could look something like this:
evals_version: 0.1
name: pelicans
prompt: |
Three names for a pet pelican, numbered
checks:
- contains: 1.
- contains: 2.
- contains: 3.
- not-contains: 4.
If each of these checks is a key: details
then the key could correspond to the registered check provided by a plugin.
So for my SQL things it could look like this - assuming a plugin that adds sqlite_execute
assertions and sqlite_setup
setup mechanism:
ev: 0.1 #ev is short for evals_version
name: SQLite SQL
system: |
You return SQL select statements for SQLite
Database schema: create table articles (id integer primary key, title text, body text, created_yyyymmdd text)
prompt: |
Count articles published in 2023
prefill: "select "
setup:
sqlite_setup: |
create table articles (id integer primary key, title text, body text, created_yyyymmdd text)
insert into articles (title, body, created_yyyymmdd) values
("One", "Body one", "2023-01-01"),
("Two", "Body two", "2023-02-01"),
("Three", "Body three", "2024-01-01"),
checks:
- sqlite_execute: [[2]]
- contains: count(*)
I want a parameterization mechanism, so you can run the same eval against multiple examples at once. Those examples can be stored directly in the YAML or can be referenced as the filename or URL to a CSV or JSON file.
I realize that, just like with pytest, I'd like to be able to apply multiple parameter groups at once - so I could eg define a set of 100 SQL query questions and assertions and then also provide 20 possible system prompts and run the whole matrix of 2,000 responses to see which system prompt scores highest.
I'm going to try sandboxed Jinja embedded in the YAML or JSON as the parameter mechanism.
Not sure how best to handle the multiple parameter groups mechanism. Feels like a CLI design problem.
I think each check is a pass/fail, but an eval can support a percentage passed score or those checks in addition to an overall pass/fail showing if all the checks passed or not.
This means the database schema that stores the results needs to have a concept of those checks so it can store results for each one of them.
Beginning to form a vocabulary (WIP):
I'm currently overthinking how to store those checks and results.
I'm going to be storing a lot of these potentially - for 20 prompts against 100 examples with 3 checks per example I'd store 6,000 check results.
I think a check_results
table with integer foreign keys to the check, the prompt, the example and a 1 or 0 column for the result. That's 4 integers stored per check.
With the initial prototype:
ev: 0.1
name: Basic languages
system: |
Return just a single word in the specified language
prompt: |
Apple in Spanish
checks:
- iexact: manzana
- notcontains: apple
Then run:
llm evals simple.yml -m 4t -m chatgpt
(Currently requires OPENAI_API_KEY
environment variable to be set.)
Output:
('gpt-4-turbo-preview', [True, True])
('gpt-3.5-turbo', [True, True])
It doesn't yet save results to a database, but it's illustrating the basic checks
mechanism - which is defined by a bunch of classes registered with the new register_eval_checks()
plugin hook:
After llm install llm-claude-3
this works:
llm evals simple.yml -m 4t -m chatgpt -m haiku -m opus
('gpt-4-turbo-preview', [True, True])
('gpt-3.5-turbo', [True, True])
('claude-3-haiku-20240307', [True, True])
('claude-3-opus-20240229', [True, True])
PyPI rejected my alpha relese because the name was too similar to https://pypi.org/project/llmevals/
https://pypi.org/project/llmeval/ also exists.
Putting repo private again while I figure out a new name.
Closing this, work will continue in other issues.
With the initial prototype:
ev: 0.1 name: Basic languages system: | Return just a single word in the specified language prompt: | Apple in Spanish checks: - iexact: manzana - notcontains: apple
I like this approach! I actually created something very similar to llm-evals-plugin for my day job, and my yaml evals look almost exactly like this.
I assume you're thinking about this, but I made checks
easily pluggable. That lets me implement new checks very easily. One very handy check is the llm
check which uses another (ideally better) model to check the output of this one. So then I can make checks like:
- llm: the second paragraph of the response mentions the importance of eating an apple every day
The idea here is to create a lightweight system for running evals against models using LLM.
This would run the evals described in that YAML file against those two models, outputting the results to stdout and also storing them in the default LLM SQLite database.
-d evals.db
would store to a different specified database.Much of the challenge will be designing that YAML format (JSON will be supported too, for people who really can't stand YAML). Especially how the evals define their assertions for checking if they succeeded.