patterns-ai-core / langchainrb

Build LLM-powered applications in Ruby
https://rubydoc.info/gems/langchainrb
MIT License
1.44k stars 195 forks source link

Improvements to Evals #851

Open bborn opened 1 month ago

bborn commented 1 month ago

Just exploring some ideas:

andreibondarev commented 1 month ago

@bborn How useful is the regex evaluator in your opinion?

Do you have any thoughts evals that calculate vector distance?

bborn commented 1 month ago

@andreibondarev I think regex or other string comparison is pretty important. You might want to ensure your model/agent is returning a number, a URL, an email, etc. Or you might want to ensure that the answer contains the expected output string (this doesn't exist in Lanchain.rb yet), something like:

regex_evaluator.score(answer: "The answer to 2 + 2 is 4", expected_answer: "4")

Vector (or levenshtein, etc.) distance seems useful too. Not so much as an absolute score but as something you could look at over time (if our agent was getting a vector score of .75 for the last three months, and then we changed the prompt and now it's getting .45, we'd be concerned).

I think the evaluators kind of break down into LLM Graded, LLM Labeled, and Code Graded):

CleanShot 2024-10-21 at 08 21 04@2x

LLM Graded: ask another LLM to score the dataset item based on some criteria LLM Labled: ask an LLM to label the dataset item Code Graded: run the dataset item through some grading algorithm (regex, JSON, or other)

bborn commented 1 month ago

Another thought: maybe you should be able to add an Eval to your Agent or llm call like this:

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

eval_service = EvalService(evaluators, dataset)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)

By default this would store eval results in a CSV (could be anything, sqlite, whatever) in the same location as the dataset.

Another idea would be the ability to log the completion results to the dataset before evaluating them (e.g. if you don't already have a dataset):

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

options: {
    log_completions: true
    log_rate: 0.5    #log 50% of completions
}

eval_service = EvalService(evaluators, dataset, options)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)