hegelai / prompttools

Open-source tools for prompt testing and experimentation, with support for both LLMs (e.g. OpenAI, LLaMA) and vector databases (e.g. Chroma, Weaviate, LanceDB).
http://prompttools.readthedocs.io
Apache License 2.0
2.55k stars 216 forks source link

experiment.evaluate() shows stale evaluation results #79

Open davidtan-tw opened 10 months ago

davidtan-tw commented 10 months ago

šŸ› Describe the bug

Hi folks,

Thanks again for your work on this library.

I noticed an issue where similarity scores do not get updated when I change my expected fields. Only when I re-run the experiment are the values updated.

Bug

image

Steps to reproduce:

models = ["gpt-3.5-turbo", "gpt-3.5-turbo-0613"]
messages = [
    [
        {"role": "system", "content": "Who is the first president of the US? Give me only the name"},
    ]
]
temperatures = [0.0]

experiment = OpenAIChatExperiment(models, messages, temperature=temperatures)
experiment.run()
experiment.visualize()

from prompttools.utils import semantic_similarity

experiment.evaluate("similar_to_expected", semantic_similarity, expected=["George Washington"] * 2)
experiment.visualize()

from prompttools.utils import semantic_similarity

experiment.evaluate("similar_to_expected", semantic_similarity, expected=["Lady Gaga"] * 2)
experiment.visualize() # the evaluation results here indicate that "Lady Gaga" is semantically identical to "George Washington"

In my opinion, evaluate() should re-compute metrics every time it is run, rather than depending/being coupled to another function (run()). I haven't tested it on other eval_fns, but it could be worth testing if this is the case as well.

NivekT commented 10 months ago

Your observation is correct. Currently, if a metric already exists (which is "similar_to_expected" in your case), it raises a warning (as seen in your notebook "WARNING: similar_to_expected is already present, skipping") rather than overwriting it.

If you change the metric name given in the second .evaluate call (i.e. experiment.evaluate("similar_to_expected_2", ...)), it will compute another column.

We are open to considering overwriting it even when the existing metric already exists. Let us know what you think.

Sruthi5797 commented 9 months ago

Thank you for this issue, I changed the variable name but still, the response column is stale. Any leads on this issue? I use python version 3.11.5

NivekT commented 9 months ago

Hi @Sruthi5797,

Can you post a minimal code snippet of what you are running? Also, are you seeing any warning message?