Evaluation function based on LLM grading

preset-io / promptimize

Promptimize is a prompt engineering evaluation and testing toolkit.

Apache License 2.0

431 stars 34 forks source link

Evaluation function based on LLM grading #7

Open msaelices opened 1 year ago

msaelices commented 1 year ago

Changes

[X] New is_correct() evaluation function, which asks an LLM model to return if a response is correct

Proof-of-life

2023-05-24_12-35

edwardmfho commented 1 year ago

Test-ran the example cases using gpt-3.5-turbo instead of the default gpt-4 (still waiting for the API). It is a good function to be added in.

mistercrunch commented 1 year ago

I like the idea, the only thing is the implementation is very openai-specific, which is probably fine as long as we make it clear. How about we break down evals.py into evals/__init__.py and evals/openai.py.

Goal would be to import is_correct from promptimize.evals.openai

edwardmfho commented 1 year ago

Should we begin with some sort of base eval function/class that could be used for other LLMs?

lain5etf7w commented 1 year ago

lain5etf7w@proton.me