Improvements to Evals - Githubissues

bborn commented 1 month ago

Just exploring some ideas:

Should be able to run evals against a full dataset (instead of one item at a time)
Should be able to run multiple evaluators at once
Add a Regex evaluator
Add cosine similarity evaluator
Add a LLM evaluator

andreibondarev commented 1 month ago

@bborn How useful is the regex evaluator in your opinion?

Do you have any thoughts evals that calculate vector distance?

bborn commented 1 month ago

@andreibondarev I think regex or other string comparison is pretty important. You might want to ensure your model/agent is returning a number, a URL, an email, etc. Or you might want to ensure that the answer contains the expected output string (this doesn't exist in Lanchain.rb yet), something like:

regex_evaluator.score(answer: "The answer to 2 + 2 is 4", expected_answer: "4")

Vector (or levenshtein, etc.) distance seems useful too. Not so much as an absolute score but as something you could look at over time (if our agent was getting a vector score of .75 for the last three months, and then we changed the prompt and now it's getting .45, we'd be concerned).

I think the evaluators kind of break down into LLM Graded, LLM Labeled, and Code Graded):

CleanShot 2024-10-21 at 08 21 04@2x

LLM Graded: ask another LLM to score the dataset item based on some criteria LLM Labled: ask an LLM to label the dataset item Code Graded: run the dataset item through some grading algorithm (regex, JSON, or other)

bborn commented 1 month ago

Another thought: maybe you should be able to add an Eval to your Agent or llm call like this:

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

eval_service = EvalService(evaluators, dataset)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)

By default this would store eval results in a CSV (could be anything, sqlite, whatever) in the same location as the dataset.

Another idea would be the ability to log the completion results to the dataset before evaluating them (e.g. if you don't already have a dataset):

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

options: {
    log_completions: true
    log_rate: 0.5    #log 50% of completions
}

eval_service = EvalService(evaluators, dataset, options)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)

patterns-ai-core / langchainrb

Improvements to Evals #851