simonw / llm-evals-plugin

Run evals using LLM
20 stars 0 forks source link

Design and implement parameterization mechanism #4

Open simonw opened 5 months ago

simonw commented 5 months ago

Initial thoughts here:

I want a parameterization mechanism, so you can run the same eval against multiple examples at once. Those examples can be stored directly in the YAML or can be referenced as the filename or URL to a CSV or JSON file.

I realize that, just like with pytest, I'd like to be able to apply multiple parameter groups at once - so I could eg define a set of 100 SQL query questions and assertions and then also provide 20 possible system prompts and run the whole matrix of 2,000 responses to see which system prompt scores highest.

Inspired by @pytest.mark.paramaterize but with a CLI interface to provide a source of parameters (CSV or JSON) along with the YAML file.

Related:

simonw commented 5 months ago

Two ways this could work:

  1. Unlimited parameter groups, as seen in pytest
  2. You can parameterize the prompt (and system prompt and options) and you can parameterize the examples, but that's only those two groups

I'm leaning towards that second option at the moment, unless I can come up with a really simple design for the first option.

simonw commented 5 months ago

There's actually a third parameter at the moment: the -m model_id option which can be used to specify multiple models.

Since different models need different prompting strategies - a Claude 3 likes XML-style tags for example - the model should probably be handled as part of the larger idea of parameterized prompt settings.

simonw commented 5 months ago

I like the idea of keeping the llm evals example.yml -m 3.5 -m 4t shortcut just because for simple evals being able to run against multiple models like that still makes sense - but it's going to be syntactic sugar for defining parameterized prompts.

simonw commented 5 months ago

Idea: what if any of the items in the YAML that are singular can optionally instead be plural, and if they are plural they are treated as parameters?

This would benefit from a subtle redesign such that the prompt plus system prompt are bundled, so you can define an eval that tries two different predetermined combinations of those - but you could also have a single prompt and a list of system prompts to try against that single prompt.

simonw commented 5 months ago

That could be a bit confusing - having differently named YAML keys for the parameterized bits may be easier to read and understand.

bradAGI commented 5 months ago

Was playing a bit with possible formats for YAML for parameterized testing... Here's what I came up with...

ev: 0.1
name: Basic languages
system: >
  Return just a single word in the specified language
  - prompt: |
      Apple in Spanish
      response: manzana
      checks:
        - iexact: manzana
        - notcontains: apple
  - prompt: |
      Bread in French
      response: pain
      checks:
        - iexact: pain
        - notcontains: bread

or

ev: 0.1
name: Basic languages
system:
  description: "Return just a single word in the specified language."
  prompts:
    - name: "Apple in Spanish"
      response: "manzana"
      checks:
        - type: "iexact"
          value: "manzana"
        - type: "notcontains"
          value: "apple"
    - name: "Bread in French"
      response: "pain"
      checks:
        - type: "iexact"
          value: "pain"
        - type: "notcontains"
          value: "bread"
castedo commented 4 months ago

@simonw Something to consider for your design is have separate stages for before vs after the responses from an LLM service, with these responses being stored for later evaluations. I did this with CopyBlaster and I found it worked well. I found having separate stages to work better than having a simple single stage approach like unit tests. I chose to store the responses in a file directory hierarchy, which works well saving it in a git repository.

The parameterization will become an important link between these stages. I'm not sure I recommend how I did parameterization, but it was very simple. Argument value choices are represented by short strings (tags) like "e0", "e1", "uk", "us", etc... but different parameters may NOT reuse tags as argument value choices. This way I can just throw parameter argument choices into a simple unstructured bag like folder name "e1-uk" to record what parameter arguments were chosen.

Let me know if there are any questions I can better answer in the minimal documentation I've written.