[Bug]: VectorIndexAutoRetriever run fail with gpt-4o

viethoang261 commented 3 months ago

Bug Description

I found that gpt-4o has some differences in output text when exporting output text in JSON format that make VectorIndexAutoRetriever not run normally.

when looking in the source code of VectorIndexAutoRetriever, I found that it uses llm to predict the query_bundle for outputting the necessary JSON structure for VectorIndexAutoRetriever

def generate_retrieval_spec(
        self, query_bundle: QueryBundle, **kwargs: Any
    ) -> BaseModel:
        # prepare input
        info_str = self._vector_store_info.json(indent=4)
        schema_str = VectorStoreQuerySpec.schema_json(indent=4)

        # call LLM
        output = self._llm.predict(
            self._prompt,
            schema_str=schema_str,
            info_str=info_str,
            query_str=query_bundle.query_str,
        )

        # parse output
        return self._parse_generated_spec(output, query_bundle)

If you use another openAI model or another LLM model, the output can parse into JSON format

"{
    "title": "VectorStoreQuerySpec",
    "description": "Schema for a structured request for vector store\n(i.e. to be converted to a VectorStoreQuery).\n\nCurrently only used by VectorIndexAutoRetriever.",
    "type": "object",
    "properties": {
        "query": {
            "title": "Query",
            "type": "string"
        },
        "filters": {
            "title": "Filters",
            "type": "array",
            "items": {
                "$ref": "#/definitions/MetadataFilter"
            }
        },
        "top_k": {
            "title": "Top K",
            "type": "integer"
        }
    },
.....
}"

but when using gpt-4o, the output will have "```json", so that makes it cannot parse into JSON format

"```json
{
    "title": "VectorStoreQuerySpec",
    "description": "Schema for a structured request for vector store\n(i.e. to be converted to a VectorStoreQuery).\n\nCurrently only used by VectorIndexAutoRetriever.",
    "type": "object",
    "properties": {
        "query": {
            "title": "Query",
            "type": "string"
        },
        "filters": {
            "title": "Filters",
            "type": "array",
            "items": {
                "$ref": "#/definitions/MetadataFilter"
            }
        },
        "top_k": {
            "title": "Top K",
            "type": "integer"
        }
    },
    "required": [
        "query",
        "filters"
    ],
...
}
```"

I think, it not only causes errors to VectorIndexAutoRetriever, but it also causes errors to any llama_index model that could use llm to generate JSON structure

Version

0.10.42

Steps to Reproduce

from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo

vector_store_info = VectorStoreInfo(
    content_info="brief biography of celebrities",
    metadata_info=[
        MetadataInfo(
            name="category",
            type="str",
            description=(
                "Category of the celebrity, one of [Sports, Entertainment,"
                " Business, Music]"
            ),
        ),
        MetadataInfo(
            name="country",
            type="str",
            description=(
                "Country of the celebrity, one of [United States, Barbados,"
                " Portugal]"
            ),
        ),
    ],
)
retriever = VectorIndexAutoRetriever(
    index, vector_store_info=vector_store_info
)

Relevant Logs/Tracbacks

validation errors for VectorStoreQuerySpec
query
  field required (type=value_error.missing)
filters
  field required (type=value_error.missing)

dosubot[bot] commented 3 months ago

@viethoang261 Hello! I'm here to help you with any bugs, questions, or contributions.

To address the issue with the VectorIndexAutoRetriever failing when using the gpt-4o model due to the presence of "```json" delimiters in the output, you can modify the code to strip these delimiters before parsing the JSON. Here is how you can do it:

Modify the generate_retrieval_spec method to strip the "json" and "" delimiters from the output string before parsing it.
Use the parse_code_markdown utility function from llama_index.legacy.output_parsers.utils to handle the stripping of markdown code block delimiters.

Here is an example of how you can implement these changes:

from llama_index.legacy.output_parsers.utils import parse_code_markdown

def generate_retrieval_spec(
        self, query_bundle: QueryBundle, **kwargs: Any
    ) -> BaseModel:
    # prepare input
    info_str = self._vector_store_info.json(indent=4)
    schema_str = VectorStoreQuerySpec.schema_json(indent=4)

    # call LLM
    output = self._llm.predict(
        self._prompt,
        schema_str=schema_str,
        info_str=info_str,
        query_str=query_bundle.query_str,
    )

    # Strip the ```json and ``` delimiters
    output = parse_code_markdown(output, only_last=True)[0]

    # parse output
    return self._parse_generated_spec(output, query_bundle)

In this example, the parse_code_markdown function is used to strip the markdown code block delimiters from the output string. This ensures that the JSON can be parsed correctly, avoiding the validation errors you encountered.

This approach should resolve the issue with the VectorIndexAutoRetriever when using the gpt-4o model ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 3 months ago

@viethoang261 contributions welcome to improve the output parsing. Otherwise I'll probably get to it, eventually

viethoang261 commented 3 months ago

@logan-markewich sorry, my mentioned above was wrong it's not about the "```json" delimiters it's about gpt-4o cannot extract output

when I'm using gpt3.5-turbo, the output is below

"```json
{"query": "dafasd", "filters": [], "top_k": null}
```"

but when using gpt-4o, the output return same as schema_str value

"```json
{
    "title": "VectorStoreQuerySpec",
    "description": "Schema for a structured request for vector store\n(i.e. to be converted to a VectorStoreQuery).\n\nCurrently only used by VectorIndexAutoRetriever.",
    "type": "object",
    "properties": {
        "query": {
            "title": "Query",
            "type": "string"
        },
        "filters": {
            "title": "Filters",
            "type": "array",
            "items": {
                "$ref": "#/definitions/MetadataFilter"
            }
        },
        "top_k": {
            "title": "Top K",
            "type": "integer"
        }
    },
    "required": [
        "query",
        "filters"
    ],
...
}
```"

logan-markewich commented 3 months ago

Ah yes. OK I've seen this reported a lot with gpt-4o right now, it's just really bad at structured outputs and function calling tbh

luancaarvalho commented 2 months ago

+1 I'm encountering the same error. Is there any workaround to bypass this issue?

dhirajsuvarna commented 1 month ago

+1, facing same issue

run-llama / llama_index