run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.58k stars 5.23k forks source link

[Bug]: ValueError: Failed to convert output to JSON: '```json\n[]\n```' #15768

Closed dimartinez closed 2 months ago

dimartinez commented 2 months ago

Bug Description

If you use 'gpt-4o-mini' with a specific questions responded by VectorStorageIndex you will get a error about convert output to JSON when is used throw RouterQueryEngine. It don't be throw if you use gpt-4o model.

Version

0.11.3

Steps to Reproduce

from http.client import responses

from llama_index.core import ( SimpleDirectoryReader, Settings, SummaryIndex, VectorStoreIndex ) from llama_index.core.tools import QueryEngineTool from llama_index.core.query_engine.router_query_engine import RouterQueryEngine from llama_index.core.selectors.llm_selectors import LLMSingleSelector from llama_index.llms.openai import OpenAI from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.core.node_parser import SentenceSplitter

Settings.llm = OpenAI(model="gpt-4o-mini") Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader( input_files=['a pdf file... fails with any document...'] ).load_data()

splitter = SentenceSplitter(chunk_size=1024) nodes = splitter.get_nodes_from_documents(documents)

summary_index = SummaryIndex(nodes) vector_index = VectorStoreIndex(nodes)

summary_query_engine = summary_index.as_query_engine( response_mode="tree_summarize", use_async=True )

vector_query_engine = vector_index.as_query_engine()

summary_tool = QueryEngineTool.from_defaults( query_engine=summary_query_engine, description=( "Useful for respond to summarizations questions about bla bla" )

)

vector_tool = QueryEngineTool.from_defaults( query_engine=vector_query_engine, description=( "Useful for respond to specific questions about bla bla" ) )

query_engine = RouterQueryEngine( selector=LLMSingleSelector.from_defaults(), query_engine_tools=[ summary_tool, vector_tool ], verbose=True )

response = query_engine.query("What's a tech radar?") print(str(response))

Relevant Logs/Tracbacks

Traceback (most recent call last):
  File "C:\Users\tabar\PycharmProjects\tech-radar\tech-radar.py", line 69, in <module>
    response = query_engine.query("What's a tech radar?")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tabar\anaconda3\envs\tech-radar\Lib\site-packages\llama_index\core\instrumentation\dispatcher.py", line 261, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tabar\anaconda3\envs\tech-radar\Lib\site-packages\llama_index\core\base\base_query_engine.py", line 52, in query
    query_result = self._query(str_or_query_bundle)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tabar\anaconda3\envs\tech-radar\Lib\site-packages\llama_index\core\instrumentation\dispatcher.py", line 261, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tabar\anaconda3\envs\tech-radar\Lib\site-packages\llama_index\core\query_engine\router_query_engine.py", line 154, in _query
    result = self._selector.select(self._metadatas, query_bundle)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tabar\anaconda3\envs\tech-radar\Lib\site-packages\llama_index\core\base\base_selector.py", line 88, in select
    return self._select(choices=metadatas, query=query_bundle)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tabar\anaconda3\envs\tech-radar\Lib\site-packages\llama_index\core\instrumentation\dispatcher.py", line 261, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tabar\anaconda3\envs\tech-radar\Lib\site-packages\llama_index\core\selectors\llm_selectors.py", line 114, in _select
    parse = self._prompt.output_parser.parse(prediction)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tabar\anaconda3\envs\tech-radar\Lib\site-packages\llama_index\core\instrumentation\dispatcher.py", line 261, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tabar\anaconda3\envs\tech-radar\Lib\site-packages\llama_index\core\output_parsers\selection.py", line 97, in parse
    raise ValueError(f"Failed to convert output to JSON: {output!r}")
ValueError: Failed to convert output to JSON: '\n[]\n'

Process finished with exit code 1
dosubot[bot] commented 2 months ago

@dimartinez Hello! I'm here to help you with any issues you're experiencing. If you need assistance with bug fixes, questions, or becoming a contributor, feel free to ask.

I found a similar unsolved discussion regarding a JSON conversion error in the RouterQueryEngine when using the gpt-4o-mini model. You can check it out here [1].

The error you're encountering seems to be related to the output_parser in the LLMSingleSelector. The output_parser is responsible for converting the output into a structured JSON format. If the output is not in the expected format, it raises a ValueError.

Here's a brief overview of how the output_parser handles JSON conversion:

  1. Initial JSON Parsing: It first attempts to marshal the output to JSON using the _marshal_llm_to_json function.
  2. Fallback Parsing: If the initial parsing fails, it uses pyyaml as a fallback to handle less strict JSON formats.
  3. Filtering and Formatting: The parsed JSON is then filtered and formatted to ensure it contains the required keys before being converted into Answer objects.

Here is the relevant code snippet:

def parse(self, output: str) -> Any:
    json_string = _marshal_llm_to_json(output)
    try:
        json_obj = json.loads(json_string)
    except json.JSONDecodeError as e_json:
        try:
            import yaml
            json_obj = yaml.safe_load(json_string)
        except yaml.YAMLError as e_yaml:
            raise OutputParserException(
                f"Got invalid JSON object. Error: {e_json} {e_yaml}. "
                f"Got JSON string: {json_string}"
            )
        except NameError as exc:
            raise ImportError("Please pip install PyYAML.") from exc
    if isinstance(json_obj, dict):
        json_obj = [json_obj]

    if not json_obj:
        raise ValueError(f"Failed to convert output to JSON: {output!r}")

    json_output = self._format_output(json_obj)
    answers = [Answer.from_dict(json_dict) for json_dict in json_output]
    return StructuredOutput(raw_output=output, parsed_output=answers)

The issue might be that the gpt-4o-mini model is returning an empty list ([]), which is not being handled correctly by the output_parser. You might want to add additional checks or handle empty outputs explicitly in your code.

For further investigation, you can refer to the output_parser implementation in the selection.py file [2][3].

To continue talking to Dosu, mention @dosu.

dimartinez commented 2 months ago

@dosubot it will works ok if use PydanticSingleSelector