run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.83k stars 5.08k forks source link

[Bug]: text-to-sql the table is incorrect when querying #16338

Open onazaruk opened 3 hours ago

onazaruk commented 3 hours ago

Bug Description

Hi,

TLDR, when I ask query "Show all topics" it creates query "SELECT name FROM topic", while the actual table name is topics, the query fails obviously

Here is my setup (I followed the guide from the documentation):

...
engine = create_engine("sqlite:///pstan.db")
sql_database = SQLDatabase(engine)
...
TABLE_MAPPING = {
    "topics": "Contains information about the topics, stores topic ID, its name and description",
    "tickets": "Contains ticket records",
    "domains": "Contins information about cloud domain"
}
table_infos = [TableInfo(k, v) for k, v in TABLE_MAPPING.items()]
table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
    SQLTableSchema(table_name=t.table_name, context_str=t.table_summary)
    for t in table_infos
] 
obj_index = ObjectIndex.from_objects(
    table_schema_objs,
    table_node_mapping,
    VectorStoreIndex,
)
table_retriever = obj_index.as_retriever(similarity_top_k=1)
query_engine = SQLTableRetrieverQueryEngine(
    sql_database,
    table_retriever,
    synthesize_response=False
)
...

Running queries:

query = "Show all topic names"

resp = table_retriever.retrieve(query)
display(Markdown(f"<b>{resp}</b>"))

resp = query_engine.query(query)
display(Markdown(f"<b>{resp}</b>"))

Result:

[SQLTableSchema(table_name='topics', context_str='Contains information about the topics, stores topic ID, its name and description')]

Error: Statement 'SELECT name FROM topic' is invalid SQL.

Interesting that when you query for a specific number of records it resolves table name correctly: E.g.

query = "Show 3 topic names"

Result:

[SQLTableSchema(table_name='topics', context_str='Contains information about the topics, stores topic ID, its name and description')]

[('bucket_connection_failed',), ('bucket_missing_error',), ('queue_permission_missing',)]

Version

0.10.31

Steps to Reproduce

  1. Setup sql DB with the table called "elements"
  2. Ask query to show all elements
  3. Observe

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 3 hours ago

The issue you're encountering seems to be related to the SQLTableRetrieverQueryEngine's handling of table name pluralization or singularization. When you query "Show all topics," it incorrectly maps to "SELECT name FROM topic" instead of "topics," which causes the query to fail. However, when specifying a number of records, it resolves the table name correctly.

This discrepancy might be due to how the engine interprets the natural language query and maps it to the SQL table names. Unfortunately, I couldn't find specific details on how the SQLTableRetrieverQueryEngine handles such discrepancies or if there are configuration options to improve table name resolution accuracy.

One potential workaround is to ensure that your table names in the database and your queries are consistent with the expected singular or plural forms. Additionally, you might want to explore if there are any configuration options or custom logic you can implement to handle these cases more accurately.

To continue talking to Dosu, mention @dosu.

onazaruk commented 2 hours ago

@dosu the schema of the db is consistent and all the tables that store objects information have plural name. Is there way to force pluralization table name logic to work for all cases?

dosubot[bot] commented 2 hours ago

You can customize the text_to_sql_prompt in the SQLTableRetrieverQueryEngine to handle pluralization of table names. By creating a custom BasePromptTemplate that includes logic for handling pluralization, you can ensure that the engine consistently uses plural table names.

Here's an example of how you might define and use a custom prompt:

from llama_index.core.prompts import BasePromptTemplate

class CustomTextToSQLPrompt(BasePromptTemplate):
    def __init__(self):
        template = "Your custom prompt logic here, including handling pluralization."
        super().__init__(template=template)

# Initialize the SQLTableRetrieverQueryEngine with the custom prompt
custom_prompt = CustomTextToSQLPrompt()
query_engine = SQLTableRetrieverQueryEngine(
    sql_database=your_sql_database,
    table_retriever=your_table_retriever,
    text_to_sql_prompt=custom_prompt
)

This approach allows you to define specific behavior for pluralization within the CustomTextToSQLPrompt class, ensuring that table names are correctly pluralized in all cases [1].

To continue talking to Dosu, mention @dosu.

onazaruk commented 1 hour ago

@dosu, why does it resolve the table name to topic if ObjectIndex clearly has topics table only in it and retriever retrieves SQLTableSchema properly for topics table?

dosubot[bot] commented 1 hour ago

The issue likely arises from the natural language processing component of the SQLTableRetrieverQueryEngine, which may not be accurately mapping the natural language query to the correct table name. Even though the ObjectIndex and the retriever correctly identify the topics table, the engine's logic for converting the query into SQL might be defaulting to a singular form, such as topic, due to its internal handling of language patterns. This discrepancy can occur if the engine's prompt or logic doesn't explicitly account for pluralization in table names. Adjusting the prompt or logic to handle plural forms, as previously suggested, could help resolve this issue.

To continue talking to Dosu, mention @dosu.