[Bug]: TextToCypher Retriever does not work if you pass in a `cypher_validator` without also passing in `allowed_output_fields`

bechbd commented 3 weeks ago

Bug Description

The TextToCypher retriever currently takes two separate parameters with different intents:

        cypher_validator (Optional[callable], optional):
            A callable function to validate the generated Cypher query. Defaults to None.
        allowed_query_fields (Optional[List[str]], optional):
            The fields to allow in the query output. Defaults to ["text", "label", "type"].

There is an issue in the code where it only runs the cypher_validator when the allowed_query_fields is set. This does not make sense as cypher_validator corrects the query before it is run and allowed_query_fields limits the results returned after the query is run. These should operate independently.

The checks in these two lines should be removed: https://github.com/run-llama/llama_index/blob/729d5f2e76cca29284157c6fd8b55fb9953739dd/llama-index-core/llama_index/core/indices/property_graph/sub_retrievers/text_to_cypher.py#L105 https://github.com/run-llama/llama_index/blob/729d5f2e76cca29284157c6fd8b55fb9953739dd/llama-index-core/llama_index/core/indices/property_graph/sub_retrievers/text_to_cypher.py#L139

There is not a need to add a check as it is already done in _parse_generated_cyher (also mispelled cypher)

The allowed_query_output check is still performed here https://github.com/run-llama/llama_index/blob/729d5f2e76cca29284157c6fd8b55fb9953739dd/llama-index-core/llama_index/core/indices/property_graph/sub_retrievers/text_to_cypher.py#L73

Version

0.10.65

Steps to Reproduce

Run a TextToCypher retriever but specify a cypher_validator but not allows_query_outputs

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 3 weeks ago

To fix the bug in the TextToCypherRetriever so that the cypher_validator runs independently of the allowed_query_fields parameter in version 0.10.65, you need to ensure that the cypher_validator is always called regardless of whether allowed_query_fields is set or not.

Here is the corrected code snippet:

def retrieve_from_graph(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
    schema = self._graph_store.get_schema_str()
    question = query_bundle.query_str

    response = self.llm.predict(
        self.text_to_cypher_template,
        schema=schema,
        question=question,
    )
    parsed_cypher_query = self._parse_generated_cypher(response)  # Always validate the cypher query

    if self.cypher_validator:
        self.cypher_validator(parsed_cypher_query)  # Ensure cypher_validator is called

    query_output = self._graph_store.structured_query(parsed_cypher_query)

    cleaned_query_output = self._clean_query_output(query_output)

    node_text = self.response_template.format(
        query=parsed_cypher_query,
        response=str(cleaned_query_output),
    )

    return [
        NodeWithScore(
            node=TextNode(
                text=node_text,
            ),
            score=1.0,
        )
    ]

async def aretrieve_from_graph(
    self, query_bundle: QueryBundle
) -> List[NodeWithScore]:
    schema = await self._graph_store.aget_schema_str()
    question = query_bundle.query_str

    response = await self.llm.apredict(
        self.text_to_cypher_template,
        schema=schema,
        question=question,
    )

    parsed_cypher_query = self._parse_generated_cypher(response)  # Always validate the cypher query

    if self.cypher_validator:
        self.cypher_validator(parsed_cypher_query)  # Ensure cypher_validator is called

    query_output = await self._graph_store.astructured_query(parsed_cypher_query)

    cleaned_query_output = self._clean_query_output(query_output)

    node_text = self.response_template.format(
        query=parsed_cypher_query,
        response=str(cleaned_query_output),
    )

    return [
        NodeWithScore(
            node=TextNode(
                text=node_text,
            ),
            score=1.0,
        )
    ]

In this fix, the self._parse_generated_cypher(response) is called directly after generating the response from the LLM, and the cypher_validator is explicitly called if it is set, ensuring that the cypher_validator is always executed ^[1].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 3 weeks ago

Yea seems like a bug, easy enough to fix :)

harshil1973 commented 1 week ago

This issue is already resolved in main branch so we can close this issue now.

run-llama / llama_index