[Question]: Langchain integration, SQLAutoVectorQueryEngine not picking tables correctly

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

Given a query_engine, query_engine.query("Tell me about the series with the highest views") gives the correct result

Querying SQL database: The first choice is more relevant as it involves translating a natural language query (like the given question) into a SQL query over tables containing movies/series and its metadata, which could potentially include information about views.
SQL query: SELECT title, views
FROM item
WHERE item_type = 'series'
ORDER BY views DESC
LIMIT 1;
SQL response: The series with the highest views is "Only for Love [THU-SUN 19.00]" with a total of 15,219,574 views.
Transformed query given SQL response: Can you provide more details about the series "Only for Love [THU-SUN 19.00]"?
query engine response: "Only for Love [THU-SUN 19.00]" is a series that was released in the year 2023. It falls under the category of series_CN. The storyline revolves around Zheng Shuyi, a finance journalist, who is given the task to work with Shi Yan, a charming CEO. Her mission in the series is to mess with his heart. The series is scheduled to air from Thursday to Sunday at 19.00.
Final response: The series with the highest views is "Only for Love [THU-SUN 19.00]" which has a total of 15,219,574 views. This series was released in the year 2023 and falls under the category of series_CN. The storyline revolves around Zheng Shuyi, a finance journalist, who is given the task to work with Shi Yan, a charming CEO. Her mission in the series is to mess with his heart. The series is scheduled to air from Thursday to Sunday at 19.00.

However, when I integrated with Langchain. Ref: https://github.com/run-llama/llama_index/blob/main/examples/langchain_demo/LangchainDemo.ipynb

tools = [
    Tool(
        name="SQLAutoVectorQueryEngine",
        func=lambda q: str(query_engine.query(q)),
        description="useful for when you want to answer queries about movies/series.",
    ),
]

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True, output_key="output")
lc_llm = ChatOpenAI(temperature=0, model="gpt-4")
agent_executor = initialize_agent(
    tools, lc_llm, agent="conversational-react-description", memory=memory, handle_parsing_errors=True, verbose=True
)

response = agent_executor.run(input="Tell me about the movie with the highest views")
response

The response didn't pick the right tables

> Entering new AgentExecutor chain...
Thought: Do I need to use a tool? Yes
Action: SQLAutoVectorQueryEngine
Action Input: {"query": "SELECT * FROM movies ORDER BY views DESC LIMIT 1"}Querying SQL database: The query provided is a SQL query related to a table containing movies, which aligns with the first choice about translating a natural language query into a SQL query over tables containing movies/series and its metadata.
SQL query: SELECT * FROM movies ORDER BY views DESC LIMIT 1
SQL response: I'm sorry, but there was an error in executing the query. The statement 'SELECT * FROM movies ORDER BY views DESC LIMIT 1' is invalid SQL.
Transformed query given SQL response: What is the most viewed movie?
query engine response: The context does not provide information on which movie is the most viewed.
Final response: I'm sorry, but the information on which movie is the most viewed is not available.

Observation: I'm sorry, but the information on which movie is the most viewed is not available.
Thought:Do I need to use a tool? No
AI: I'm sorry, but I currently don't have the specific information on the movie with the highest number of views.

> Finished chain.

🤖

Hello @cocoza4! 🙌 I'm Dosu, an AI bot designed to lend a hand with bugs, answer queries, and help you contribute to this project. Think of me as your friendly guide while we wait for a human maintainer. Let's dive into this issue, shall we?

It seems like the issue you're encountering is related to how the SQLAutoVectorQueryEngine is handling table selection when processing queries. As per the LlamaIndex documentation, the SQLAutoVectorQueryEngine does not directly handle table selection. Instead, it relies on the SQL query engine provided to it during initialization.

In your case, it appears that the SQL query engine is not correctly identifying the table to use when processing the query "Tell me about the movie with the highest views". This could be due to a variety of reasons, such as the SQL query engine not being properly configured or the underlying SQL database not containing the expected tables or data.

To troubleshoot this issue, I would recommend the following steps:

Verify that the SQL query engine is correctly configured and can successfully execute queries against your SQL database outside of the Langchain integration. This can help determine if the issue is with the SQL query engine itself or with the integration.
Check the structure and data of your SQL database to ensure it contains the expected tables and data. The SQL query engine relies on this information to correctly process queries.
Review the Langchain integration code to ensure it is correctly initializing the SQLAutoVectorQueryEngine with the SQL query engine. The SQLAutoVectorQueryEngine requires a SQL query engine that is an instance of either BaseSQLTableQueryEngine or NLSQLTableQueryEngine.

If you continue to experience issues after trying these steps, please provide more information about your SQL query engine configuration and the structure and data of your SQL database. This will help me better understand the issue and provide more targeted assistance.

Sources

llama_index/query_engine/sql_vector_query_engine.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosu-bot after investigating, I still experienced the same issue, here's my SQL query engine code,

from sqlalchemy import (
    create_engine,
)

engine = create_engine("sqlite:///trueid.db", future=True)

from llama_index.indices.struct_store.sql_query import NLSQLTableQueryEngine

tables = ["item"]

sql_database = SQLDatabase(engine, include_tables=tables)
sql_query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=tables,
)

vector_store_info = VectorStoreInfo(
    content_info="metadata about different movies/series",
    metadata_info=[
        MetadataInfo(
            name="title", type="str", description="The name of the item"
        ),
        MetadataInfo(
            name="release_year", type="str", description="Release year"
        ),
        MetadataInfo(
            name="type", type="str", description="Type of the item"
        ),
        MetadataInfo(
            name="other_type", type="str", description="Other type of the item"
        ),
        MetadataInfo(
            name="vod_main_category", type="str", description="Main category"
        ),
    ],
)
vector_auto_retriever = VectorIndexAutoRetriever(
    vector_index, vector_store_info=vector_store_info
)

retriever_query_engine = RetrieverQueryEngine.from_args(
    vector_auto_retriever, service_context=service_context
)

sql_tool = QueryEngineTool.from_defaults(
    query_engine=sql_query_engine,
    description=(
        "Useful for translating a natural language query into a SQL query over"
        " tables containing: item, containing movies/series and its metadata"
    ),
)
vector_tool = QueryEngineTool.from_defaults(
    query_engine=retriever_query_engine,
    description=(
        "Useful for answering semantic questions about different movies/series"
    ),
)
query_engine = SQLAutoVectorQueryEngine(
    sql_tool, vector_tool, service_context=service_context
)

🤖

Based on the information you've provided, it seems like the issue might be related to the selector logic in the SQLAutoVectorQueryEngine. This selector decides whether to query the SQL database or another query engine. If the selector is not choosing the SQL database when it should, it could lead to the issue you're experiencing.

Here are a few steps you can take to debug this issue:

Check the selector logic: The selector is an instance of either LLMSingleSelector or PydanticSingleSelector. It's used in the _query method of the SQLJoinQueryEngine class. The select method of the selector is called with the metadata of the SQL and other query engines, and the query bundle. If the index returned by the select method is 0, the SQL database is queried; if the index is 1, the other query engine is queried. You can add logging or print statements in the _query method to check if the selector is working correctly.
Check the SQL query: If the selector is working correctly, the issue might be with the SQL query. You can add logging or print statements in the _query_sql_other method to check if the SQL query is correctly formed and if it's returning the expected results.
Check the transformation of the query: The _query_sql_other method transforms the SQL query into a new query using the SQL response. You can add logging or print statements in this method to check if the transformation is correct.
Check the other query engine: If the SQL query and its transformation are correct, the issue might be with the other query engine. You can add logging or print statements in the _query_sql_other method to check if the other query engine is returning the expected results.
Check the synthesis of the responses: The _query_sql_other method synthesizes the responses from the SQL database and the other query engine into a final response. You can add logging or print statements in this method to check if the synthesis is correct.

You can find more information about the SQLJoinQueryEngine and its methods in the LlamaIndex repository.

Please try these debugging steps and let me know if you need further assistance.

Sources

llama_index/query_engine/sql_join_query_engine.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

it turns out that ChatGPT translates human language into SQL query and passes it to Langchain Tool, as shown in

Entering new AgentExecutor chain...
Thought: Do I need to use a tool? Yes
Action: SQLAutoVectorQueryEngine
Action Input: {"query": "SELECT * FROM movies ORDER BY views DESC LIMIT 1"}

The simple fix is as follows

from pydantic.v1 import BaseModel, Field

# Define the input schema
class QueryInput(BaseModel):
    q: str = Field(..., description="Only accept Human language query")

def query_tool(q: str):
    try:
        return str(query_engine.query(q))
    except Exception as ex:
        print("Exception thrown:", ex)

tools = [
    Tool(
        name="SQLAutoVectorQueryEngine",
        func=query_tool,
        description=f"useful for when you want to answer queries about movies/series. Input is the human language.", # <- specify clearly human language is expected
        args_schema=QueryInput,  # <- add schema
    ),
]

Now the input to Langchain tool would be

Entering new AgentExecutor chain...
Thought: Do I need to use a tool? Yes
Action: SQLAutoVectorQueryEngine
Action Input: Tell me about the movie with the highest views

run-llama / llama_index