Open simonw opened 2 months ago
Current hook spec: https://github.com/simonw/llm-evals-plugin/blob/e427453a1995fdc9d2270581f33851c256bc0c3a/llm_evals/hookspecs.py#L6-L8
Example usage:
@llm.hookimpl
def register_eval_checks():
return [Iexact]
I think there should probably be a Check
class that all checks subclass. It could handle error reporting perhaps?
I'm also tempted to make checks use assert
internally - something like this perhaps:
assert return self.text.lower() == response.text().lower(), "Lowercase match failed"
There could be a llm-evals-unsafe-python
plugin which adds a check which looks like this:
checks:
- unsafe_python: assert response.text() == "Unsafely checked here"
(Is the "unsafe" designation really necessary here? I feel like it is, because I want to discourage people from running llm evals https://url-to-some-yaml-file
against unstrusted files that could execute arbitrary code if they are running unsafe plugins. But maybe having unsafe
in the name of the plugin is enough warning there?)
Another option there is that unsafe plugins could mark themselves as such and then you have to run llm evals ... --unsafe
to execute them.
I want a llm-evals-quickjs
plugin which WILL be safe because it will support checks like this one which run in a sandbox:
checks:
- quickjs: |
if (response.text() !== "expected") {
throw new Error("not expected");
}
So people can see how they can implement their own checks.
I'll also release the first checks plugin, for SQLite query execution, mainly as a demo.