[Bug]: Error with CorrectnessEvaluator Using HuggingFace API in RAG Evaluation Metrics

Bug Description

Hello,

I am encountering an issue while attempting to run the RAG evaluation metrics, specifically the CorrectnessEvaluator.

I created an index for the PaulGrahamEssayDataset available on the llamaindex hub, employing the HuggingFace embeddings model "BAAI/bge-small-en" and the HuggingFace API "mistralai/Mistral-7B-v0.1".
Then, I computed predictions using the rag_dataset correlated with the PaulGrahamEssayDataset.
In the final step, while trying to compute the correctness score, I encountered this error: "ValueError: could not convert string to float: ''. But, when I modified the ServiceContext to a new one, using VertexAI's "text-bison@001", the process worked as expected.

Could you please provide any insights or guidance on why this error might be occurring with the HuggingFace API, and how it might be resolved?

Thanks.

Version

0.9.17

Steps to Reproduce

I reproduced this notebook but without using the OpenAI llm: https://github.com/run-llama/llama_index/blob/main/docs/examples/llama_dataset/downloading_llama_datasets.ipynb

I tried to use mistralai/Mistral-7B-v0.1 from HuggingFace API and it didn't work:

import os
from llama_index.llms import HuggingFaceInferenceAPI

HF_TOKEN = os.getenv("HF_API_KEY")
model_name = "mistralai/Mistral-7B-v0.1"

llm = HuggingFaceInferenceAPI(
    model_name=model_name, token=HF_TOKEN
)

I also tried with text-bison@001 from VertexAI and this time it worked as expected:

from langchain.llms import VertexAI
from llama_index.llms import LangChainLLM

bison = VertexAI(model_name="text-bison@001",
               max_output_tokens=300,
               temperature=0.1,
               top_p=0.8,
               top_k=40,
               verbose=False
              )

llm_vertex = LangChainLLM(llm=bison)

Relevant Logs/Tracbacks

ValueError                                Traceback (most recent call last)
Cell In[14], line 14
      5 evals = {
      6     "correctness": [],
      7     # "relevancy": [],
      8     # "faithfulness": [],
      9     # "context_similarity": [],
     10 }
     11 for example, prediction in tqdm.tqdm(
     12     zip(rag_dataset.examples, prediction_dataset.predictions)
     13 ):
---> 14     correctness_result = judges["correctness"].evaluate(
     15         query=example.query,
     16         response=prediction.response,
     17         reference=example.reference_answer,
     18     )
     20     # relevancy_result = judges["relevancy"].evaluate(
     21     #     query=example.query,
     22     #     response=prediction.response,
   (...)
     35     #     reference="\n".join(example.reference_contexts),
     36     # )
     38     evals["correctness"].append(correctness_result)

File ~/miniconda/envs/genai/lib/python3.11/site-packages/llama_index/evaluation/base.py:56, in BaseEvaluator.evaluate(self, query, response, contexts, **kwargs)
     43 def evaluate(
     44     self,
     45     query: Optional[str] = None,
   (...)
     48     **kwargs: Any,
     49 ) -> EvaluationResult:
     50     """Run evaluation with query string, retrieved contexts,
     51     and generated response string.
     52 
     53     Subclasses can override this method to provide custom evaluation logic and
     54     take in additional arguments.
     55     """
---> 56     return asyncio.run(
     57         self.aevaluate(
     58             query=query,
     59             response=response,
     60             contexts=contexts,
     61             **kwargs,
     62         )
     63     )

File ~/miniconda/envs/genai/lib/python3.11/site-packages/nest_asyncio.py:31, in _patch_asyncio.<locals>.run(main, debug)
     29 task = asyncio.ensure_future(main)
     30 try:
---> 31     return loop.run_until_complete(task)
     32 finally:
     33     if not task.done():

File ~/miniconda/envs/genai/lib/python3.11/site-packages/nest_asyncio.py:99, in _patch_loop.<locals>.run_until_complete(self, future)
     96 if not f.done():
     97     raise RuntimeError(
     98         'Event loop stopped before Future completed.')
---> 99 return f.result()

File ~/miniconda/envs/genai/lib/python3.11/asyncio/futures.py:203, in Future.result(self)
    201 self.__log_traceback = False
    202 if self._exception is not None:
--> 203     raise self._exception.with_traceback(self._exception_tb)
    204 return self._result

File ~/miniconda/envs/genai/lib/python3.11/asyncio/tasks.py:267, in Task.__step(***failed resolving arguments***)
    263 try:
    264     if exc is None:
    265         # We use the `send` method directly, because coroutines
    266         # don't have `__iter__` and `__next__` methods.
--> 267         result = coro.send(None)
    268     else:
    269         result = coro.throw(exc)

File ~/miniconda/envs/genai/lib/python3.11/site-packages/llama_index/evaluation/correctness.py:140, in CorrectnessEvaluator.aevaluate(***failed resolving arguments***)
    132 eval_response = await self._service_context.llm.apredict(
    133     prompt=self._eval_template,
    134     query=query,
    135     generated_answer=response,
    136     reference_answer=reference,
    137 )
    139 # Use the parser function
--> 140 score, reasoning = self.parser_function(eval_response)
    142 return EvaluationResult(
    143     query=query,
    144     response=response,
   (...)
    147     feedback=reasoning,
    148 )

File ~/miniconda/envs/genai/lib/python3.11/site-packages/llama_index/evaluation/eval_utils.py:76, in default_parser(eval_response)
     66 """
     67 Default parser function for evaluation response.
     68 
   (...)
     73     Tuple[float, str]: A tuple containing the score as a float and the reasoning as a string.
     74 """
     75 score_str, reasoning_str = eval_response.split("\n", 1)
---> 76 score = float(score_str)
     77 reasoning = reasoning_str.lstrip("\n")
     78 return score, reasoning

ValueError: could not convert string to float: ''

🤖

Hello @MaxenceQueyrel,

Nice to meet you! I'm Dosu, a bot here to assist you while we wait for a human maintainer. I'm designed to help solve bugs, answer questions, and guide you in contributing to our repository. I'm here to make your experience smoother. Let's get to work on your issue!

Based on the error message you provided, it seems that the issue arises when the default_parser function tries to convert an empty string to a float. This function is used to parse the evaluation response from the language model, which should be a string containing a numerical score followed by a reasoning string. The error indicates that the language model's response does not contain a numerical score, hence the conversion error.

The mistralai/Mistral-7B-v0.1 model from HuggingFace might not be returning the expected output format. The expected output should be a string where the first line can be parsed into a float (representing the score), and the rest of the string is the reasoning. If the model does not return a score in the first line of its output, the default_parser function will fail with a ValueError.

Here's the relevant code from the default_parser function in llama_index/evaluation/eval_utils.py:

def default_parser(eval_response: str) -> Tuple[float, str]:
    score_str, reasoning_str = eval_response.split("\n", 1)
    score = float(score_str)
    reasoning = reasoning_str.lstrip("\n")
    return score, reasoning

As a workaround, you could try to implement a custom parser function that can handle the output format of the mistralai/Mistral-7B-v0.1 model. You can pass this function to the CorrectnessEvaluator when you instantiate it:

def custom_parser(eval_response: str) -> Tuple[float, str]:
    # Implement your custom parsing logic here
    ...

correctness_evaluator = CorrectnessEvaluator(parser_function=custom_parser)

However, without knowing the exact output format of the mistralai/Mistral-7B-v0.1 model, it's hard to provide a specific implementation for the custom_parser function. You might need to inspect the output of the model to understand how to parse it.

As for why the text-bison@001 model from VertexAI works as expected, it's likely that this model returns an output format that is compatible with the default_parser function.

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources:

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

This is likely because mistralai/Mistral-7B-v0.1 is not predicting an answer that we can parse (i.e. it's not following instructions)

mistralai/Mistral-7B-v0.1 is not fine-tuned, and is a raw completion model. Likely, you'll want to use something like the instruct fine-tuned version (which requires some extra formatting settings)

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

llm = HuggingFaceInferenceAPI(
    model_name=model_name, 
    token=HF_TOKEN,
    query_wrapper_prompt="<s>[INST] {query_str} [/INST] </s>\n"
)

Thanks for this explanation! It is on a better way, but I still have the problem. Following your advice I load the LLM like this:

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

llm = HuggingFaceInferenceAPI(
    model_name=model_name, 
    token=HF_TOKEN,
    query_wrapper_prompt=PromptTemplate("<s>[INST] {query_str} [/INST] </s>\n")
)

When I re-run my code below, it seems to work for a few samples, but then it raises the same error.

correctness_result = judges["correctness"].evaluate(
            query=example.query,
            response=prediction.response,
            reference=example.reference_answer,
        )

A working example is with the following parameter value:

query = 'In the essay, the author mentions his early experiences with programming. Describe the first computer he used for programming, the language he used, and the challenges he faced.'
response = "\nThe first computer the author used for programming was an IBM 1401, which he used in 9th grade. The language he used was an early version of Fortran. The challenges he faced with the IBM 1401 were that he couldn't figure out what to do with it and there wasn't much he could do with it, as the only form of input to programs was data stored on punched cards and he didn't have any data stored on punched cards. He also mentions that he was puzzled by the 1401 and that in retrospect there's not much he could have done with it."
reference = "The first computer the author used for programming was the IBM 1401, which was used by his school district for data processing. He started using it in 9th grade, around the age of 13 or 14. The programming language he used was an early version of Fortran. The author faced several challenges while using this computer. The only form of input to programs was data stored on punched cards, and he didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but he didn't know enough math to do anything interesting of that type. Therefore, he couldn't figure out what to do with it and in retrospect, he believes there's not much he could have done with it."

A non working example that returns the error: ValueError: could not convert string to float: '', has these parameter value:

query = 'In the essay, the author mentions his interest in both computer science and art. Discuss how he attempts to reconcile these two interests during his time in grad school. Provide specific examples from the text.'
response = "\nThe author's interest in both computer science and art creates a conflict during his time in grad school. He is working on his PhD in computer science, but he also wants to be an artist and enjoys Lisp hacking. He feels like he is working on multiple projects that are not his thesis.\n\nTo reconcile these two interests, the author decides to take a shot at writing a dissertation in the remaining 5 weeks before the deadline. He chooses the topic of applications of continuations, which he believes will be easier to write about than macros and embedded languages. He is able to write his dissertation quickly, using parts of On Lisp where he can, and he manages to graduate that June.\n\nAfter graduating, the author applies to art schools. He is accepted into the BFA program at RISD, which means he has to go to college again. He spends the summer foundation classes, which he enjoys, but he is also preparing for the entrance exam to the Accademia di Belli Arti in Florence.\n\nThe author is accepted into the Accademia, but he discovers that he has to learn Italian to take the entrance exam. He manages to pass the"
reference = 'The author in the essay is Paul Graham, who was pursuing a PhD in computer science but also developed an interest in art. He found himself in a complex situation where he was simultaneously exploring different fields and trying to reconcile his interests.\n\nIn terms of computer science, Graham was particularly interested in Lisp, a programming language. He decided to write a book about Lisp hacking, titled "On Lisp," as a way to deepen his understanding of the language. He was also intrigued by the idea of building things that would last, which he felt was a challenge in the field of systems work in computer science as any program written would eventually become obsolete.\n\nOn the other hand, Graham also developed a fascination with art. He was inspired by a visit to the Carnegie Institute where he realized that paintings, unlike software, do not become obsolete and can last for hundreds of years. This realization led him to consider the possibility of becoming an artist himself. Despite having no prior experience, he started taking art classes at Harvard while still being in the PhD program.\n\nGraham found himself working on multiple projects that were not related to his thesis. He was writing his book on Lisp, taking art classes, and still trying to complete his PhD. He was in a dilemma about how to reconcile his varied interests and find a way out of grad school.\n\nA turning point came when his professor asked if he was ready to graduate. Despite not having written a word of his dissertation, Graham decided to write one in the remaining weeks before the deadline, reusing parts of his book "On Lisp" where he could. This decision reflects his attempt to reconcile his interests by integrating his work on Lisp into his academic obligations.'

I am struggling to find out why this is happening, do you have any idea please?

And thus, we've arrived at the joy of open source LLMs. Once again, it's producing some output that llama-index can't parse.

This tends to happen with open-source LLMs, they really aren't quite reliable for this task.

Better to try/except these cases and either retry or skip

Okay, so it is more related to the open source LLMs. Thanks for your help !

run-llama / llama_index