Open simonw opened 6 months ago
Two ways this could work:
pytest
I'm leaning towards that second option at the moment, unless I can come up with a really simple design for the first option.
There's actually a third parameter at the moment: the -m model_id
option which can be used to specify multiple models.
Since different models need different prompting strategies - a Claude 3 likes XML-style tags for example - the model should probably be handled as part of the larger idea of parameterized prompt settings.
I like the idea of keeping the llm evals example.yml -m 3.5 -m 4t
shortcut just because for simple evals being able to run against multiple models like that still makes sense - but it's going to be syntactic sugar for defining parameterized prompts.
Idea: what if any of the items in the YAML that are singular can optionally instead be plural, and if they are plural they are treated as parameters?
This would benefit from a subtle redesign such that the prompt plus system prompt are bundled, so you can define an eval that tries two different predetermined combinations of those - but you could also have a single prompt and a list of system prompts to try against that single prompt.
That could be a bit confusing - having differently named YAML keys for the parameterized bits may be easier to read and understand.
Was playing a bit with possible formats for YAML for parameterized testing... Here's what I came up with...
ev: 0.1
name: Basic languages
system: >
Return just a single word in the specified language
- prompt: |
Apple in Spanish
response: manzana
checks:
- iexact: manzana
- notcontains: apple
- prompt: |
Bread in French
response: pain
checks:
- iexact: pain
- notcontains: bread
or
ev: 0.1
name: Basic languages
system:
description: "Return just a single word in the specified language."
prompts:
- name: "Apple in Spanish"
response: "manzana"
checks:
- type: "iexact"
value: "manzana"
- type: "notcontains"
value: "apple"
- name: "Bread in French"
response: "pain"
checks:
- type: "iexact"
value: "pain"
- type: "notcontains"
value: "bread"
@simonw Something to consider for your design is have separate stages for before vs after the responses from an LLM service, with these responses being stored for later evaluations. I did this with CopyBlaster and I found it worked well. I found having separate stages to work better than having a simple single stage approach like unit tests. I chose to store the responses in a file directory hierarchy, which works well saving it in a git repository.
The parameterization will become an important link between these stages. I'm not sure I recommend how I did parameterization, but it was very simple. Argument value choices are represented by short strings (tags) like "e0", "e1", "uk", "us", etc... but different parameters may NOT reuse tags as argument value choices. This way I can just throw parameter argument choices into a simple unstructured bag like folder name "e1-uk" to record what parameter arguments were chosen.
Let me know if there are any questions I can better answer in the minimal documentation I've written.
Initial thoughts here:
Inspired by
@pytest.mark.paramaterize
but with a CLI interface to provide a source of parameters (CSV or JSON) along with the YAML file.Related:
3