Open bborn opened 1 month ago
@bborn How useful is the regex evaluator in your opinion?
Do you have any thoughts evals that calculate vector distance?
@andreibondarev I think regex or other string comparison is pretty important. You might want to ensure your model/agent is returning a number, a URL, an email, etc. Or you might want to ensure that the answer contains the expected output string (this doesn't exist in Lanchain.rb yet), something like:
regex_evaluator.score(answer: "The answer to 2 + 2 is 4", expected_answer: "4")
Vector (or levenshtein, etc.) distance seems useful too. Not so much as an absolute score but as something you could look at over time (if our agent was getting a vector score of .75 for the last three months, and then we changed the prompt and now it's getting .45, we'd be concerned).
I think the evaluators kind of break down into LLM Graded, LLM Labeled, and Code Graded):
LLM Graded: ask another LLM to score the dataset item based on some criteria LLM Labled: ask an LLM to label the dataset item Code Graded: run the dataset item through some grading algorithm (regex, JSON, or other)
Another thought: maybe you should be able to add an Eval to your Agent or llm call like this:
dataset = "/path/to/dataset.jsonl"
evaluators = [
Langchain::Evals::LLM::LLM.new(llm: llm),
Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]
eval_service = EvalService(evaluators, dataset)
response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)
By default this would store eval results in a CSV (could be anything, sqlite, whatever) in the same location as the dataset.
Another idea would be the ability to log the completion results to the dataset before evaluating them (e.g. if you don't already have a dataset):
dataset = "/path/to/dataset.jsonl"
evaluators = [
Langchain::Evals::LLM::LLM.new(llm: llm),
Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]
options: {
log_completions: true
log_rate: 0.5 #log 50% of completions
}
eval_service = EvalService(evaluators, dataset, options)
response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)
Just exploring some ideas: