[BUG] Evaluate on test dataset using evaluate() with SimilarityEvaluator returns NaN

bhonris commented 4 months ago

Describe the bug When running the evaluation a dataset using evaluate() using the similarity evaluator I have come across some scenarios where the result is not a number. How To Reproduce the bug Model config {azure_deployment= "gpt4-turbo-preview", api_version="2024-02-01"} jsonl file {"Question":"How can you get the version of the Kubernetes cluster?","Answer":"{\"code\": \"kubectl version\" }","output":"{code: kubectl version --output=json}"} Evaluate Config

result = evaluate(
    data="testdata2.jsonl",
    evaluators={
        "similarity": SimilarityEvaluator(model_config)
    },
    evaluator_config={
        "default": {
            "question": "${data.Question}",
            "answer": "${data.output}",
            "ground_truth": "${data.Answer}"
        }
    }
)

Expected behavior Value returned is number

Running Information(please complete the following information):

Promptflow Package Version using pf -v:

{
"promptflow": "1.1.1",
"promptflow-azure": "1.11.0",
"promptflow-core": "1.11.0",
"promptflow-devkit": "1.11.0",
"promptflow-evals": "0.3.0",
"promptflow-tracing": "1.11.0"
}

Operating System: Windows 11
Python Version using python --version: 3.10.11

Additional context

Checking the actual logged value in _similarity.py suggests the actual returned value is the string 'The'.
I notice that this issue usually occurs when the answer does not match what the LLM response based on the question would be. For example, {Question: What is the capital of France?, Answer: Washington DC, }

bhonris commented 4 months ago

I have added to similarity.prompty the following text: "You will respond with a single digit number between 1 and 5. You will include no other text or information", and this seems to fix the issue.

brynn-code commented 4 months ago

Hi @singankit and @luigiw , could you please help take a look at this issue?

luigiw commented 4 months ago

@bhonris , thank you for reporting the issue and sharing a workaround. It is a known issue that some preview OpenAI models will cause NaN results. Please also try with stable version models.

github-actions[bot] commented 3 months ago

Hi, we're sending this friendly reminder because we haven't heard back from you in 30 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 7 days of this comment, the issue will be automatically closed. Thank you!

luigiw commented 2 months ago

Fixed in 0.3.2 version.

microsoft / promptflow

[BUG] Evaluate on test dataset using evaluate() with SimilarityEvaluator returns NaN #3381