RagEvaluatorPack instance throws: ValueError: could not convert string to float '' [Bug]:

vecorro commented 6 months ago

Bug Description

While running the following code block:

rag_eval_dataset = LabelledRagDataset.from_json(
    path=EVAL_DS_PATH
)

rag_evaluator = RagEvaluatorPack(
    judge_llm=judge_llm,
    query_engine=query_engine,  # built with the same source Documents as the rag_dataset
    rag_dataset=rag_eval_dataset,
    show_progress=True,
)

benchmark_df = await rag_evaluator.run()

I get the ValueError: could not convert string to float: after over the 440 Q/A pairs (10 samples per batch) had been processed:

Version

v0.10.17

Steps to Reproduce

Here is the list of packages I'm using:

llama-index                              0.10.17
llama-index-agent-openai                 0.1.5
llama-index-cli                          0.1.8
llama-index-core                         0.10.17
llama-index-embeddings-huggingface       0.1.4
llama-index-embeddings-openai            0.1.6
llama-index-indices-managed-llama-cloud  0.1.3
llama-index-legacy                       0.9.48
llama-index-llms-azure-openai            0.1.5
llama-index-llms-litellm                 0.1.3
llama-index-llms-openai                  0.1.7
llama-index-llms-openai-like             0.1.3
llama-index-multi-modal-llms-openai      0.1.4
llama-index-packs-rag-evaluator          0.1.4
llama-index-postprocessor-cohere-rerank  0.1.2
llama-index-program-openai               0.1.4
llama-index-question-gen-openai          0.1.3
llama-index-readers-file                 0.1.8
llama-index-readers-llama-parse          0.1.3
llama-index-vector-stores-chroma         0.1.5
llama-index-vector-stores-postgres       0.1.2
llamaindex-py-client                     0.1.13
llama-index-packs-rag-evaluator          0.1.4

The error stack is the following:


File ~/miniconda3/envs/llm-env4/lib/python3.10/site-packages/llama_index/core/evaluation/correctness.py:146, in CorrectnessEvaluator.aevaluate(***failed resolving arguments***)
    138 eval_response = await self._llm.apredict(
    139     prompt=self._eval_template,
    140     query=query,
    141     generated_answer=response,
    142     reference_answer=reference or "(NO REFERENCE ANSWER SUPPLIED)",
    143 )
    145 # Use the parser function
--> 146 score, reasoning = self.parser_function(eval_response)
    148 return EvaluationResult(
    149     query=query,
    150     response=response,
   (...)
    153     feedback=reasoning,
    154 )

File ~/miniconda3/envs/llm-env4/lib/python3.10/site-packages/llama_index/core/evaluation/eval_utils.py:183, in default_parser(eval_response)
    173 """
    174 Default parser function for evaluation response.
    175 
   (...)
    180     Tuple[float, str]: A tuple containing the score as a float and the reasoning as a string.
    181 """
    182 score_str, reasoning_str = eval_response.split("\n", 1)
--> 183 score = float(score_str)
    184 reasoning = reasoning_str.lstrip("\n")
    185 return score, reasoning
ValueError: could not convert string to float: ''

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 6 months ago

From the error stack you've provided, it seems like the error is occurring in the default_parser function in eval_utils.py. This function is trying to split the evaluation response into a score and a reasoning, and then convert the score into a float. The error message indicates that it's trying to convert an empty string to a float, which is not possible.

This could be happening because the apredict method in CorrectnessEvaluator.aevaluate is returning an evaluation response that doesn't contain a score. The apredict method is supposed to return a string containing the score and the reasoning separated by a newline character. If the score is missing or not a valid float, it would cause the ValueError you're seeing.

Here are a few things you can try to resolve this issue:

Check the apredict method: Make sure the apredict method in your judge_llm is returning a string with the score and the reasoning separated by a newline character. The score should be a valid float.
Update to the latest version: You mentioned that you're using version v0.10.17 of the llama-index package. The latest version is v0.10.18. It's possible that this issue has been fixed in the latest version. You can update to the latest version using pip:

pip install --upgrade llama-index

Check the parser_function: If you're using a custom parser_function, make sure it's correctly parsing the evaluation response. The default parser function expects the score to be the first line of the response and the reasoning to be the rest of the response.

If none of these suggestions resolve the issue, it would be helpful if you could provide more information about your judge_llm and the apredict method. Specifically, what is the format of the string it's returning? This will help us better understand what might be causing the issue.

For more information, you can refer to the LlamaIndex documentation and the source code for the RagEvaluatorPack.

Sources

llama-index-packs/llama-index-packs-rag-evaluator/examples/example.py

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

vecorro commented 6 months ago

Hi,

The problem seems to be at line 182 of the ~/miniconda3/envs/llm-env4/lib/python3.10/site-packages/llama_index/core/evaluation/eval_utils.py script:

182 score_str, reasoning_str = eval_response.split("\n", 1)

I'm using Zephyr-7b or LLama-2-70b-chat models as judges. In both cases they produce evaluations containing a leading \n character, for instance:

4.5
The generated answer provides a comprehensive overview ...

The quick fix for this is to modify line 182 by invoking the .strip() method before the .split(), like this:

score_str, reasoning_str = eval_response.strip().split("\n", 1)

Can you propose this fix to the RagEvaluatorPack maintainer?

dosubot[bot] commented 6 months ago

@logan-markewich, this one's a bit out of my wheelhouse, could you step in and provide your guidance?

vecorro commented 6 months ago

Modified the parser to make it work.

run-llama / llama_index