[Bug]: Example for JSONalyze Query Engine fails with OperationalError: near "sql": syntax error

tpanza commented 1 week ago

Bug Description

Tried following this example for the JSONalyze Query Engine (except for the LLM being Azure OpenAI), and OperationalError: near "sql": syntax error is raised.

Version

0.10.38

Steps to Reproduce

Just run the code from the docs: https://docs.llamaindex.ai/en/latest/examples/query_engine/JSONalyze_query_engine/

My setup required setting the LLM to Azure OpenAI.

#%%

# https://docs.llamaindex.ai/en/stable/examples/query_engine/JSONalyze_query_engine/
#%%
import logging
import os
import sys

from IPython.display import Markdown, display
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core.query_engine import JSONalyzeQueryEngine

#%%
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
#%%
json_list = [
    {
        "name": "John Doe",
        "age": 25,
        "major": "Computer Science",
        "email": "john.doe@example.com",
        "address": "123 Main St",
        "city": "New York",
        "state": "NY",
        "country": "USA",
        "phone": "+1 123-456-7890",
        "occupation": "Software Engineer",
    },
    {
        "name": "Jane Smith",
        "age": 30,
        "major": "Business Administration",
        "email": "jane.smith@example.com",
        "address": "456 Elm St",
        "city": "San Francisco",
        "state": "CA",
        "country": "USA",
        "phone": "+1 234-567-8901",
        "occupation": "Marketing Manager",
    },
    {
        "name": "Michael Johnson",
        "age": 35,
        "major": "Finance",
        "email": "michael.johnson@example.com",
        "address": "789 Oak Ave",
        "city": "Chicago",
        "state": "IL",
        "country": "USA",
        "phone": "+1 345-678-9012",
        "occupation": "Financial Analyst",
    },
    {
        "name": "Emily Davis",
        "age": 28,
        "major": "Psychology",
        "email": "emily.davis@example.com",
        "address": "234 Pine St",
        "city": "Los Angeles",
        "state": "CA",
        "country": "USA",
        "phone": "+1 456-789-0123",
        "occupation": "Psychologist",
    },
    {
        "name": "Alex Johnson",
        "age": 27,
        "major": "Engineering",
        "email": "alex.johnson@example.com",
        "address": "567 Cedar Ln",
        "city": "Seattle",
        "state": "WA",
        "country": "USA",
        "phone": "+1 567-890-1234",
        "occupation": "Civil Engineer",
    },
    {
        "name": "Jessica Williams",
        "age": 32,
        "major": "Biology",
        "email": "jessica.williams@example.com",
        "address": "890 Walnut Ave",
        "city": "Boston",
        "state": "MA",
        "country": "USA",
        "phone": "+1 678-901-2345",
        "occupation": "Biologist",
    },
    {
        "name": "Matthew Brown",
        "age": 26,
        "major": "English Literature",
        "email": "matthew.brown@example.com",
        "address": "123 Peach St",
        "city": "Atlanta",
        "state": "GA",
        "country": "USA",
        "phone": "+1 789-012-3456",
        "occupation": "Writer",
    },
    {
        "name": "Olivia Wilson",
        "age": 29,
        "major": "Art",
        "email": "olivia.wilson@example.com",
        "address": "456 Plum Ave",
        "city": "Miami",
        "state": "FL",
        "country": "USA",
        "phone": "+1 890-123-4567",
        "occupation": "Artist",
    },
    {
        "name": "Daniel Thompson",
        "age": 31,
        "major": "Physics",
        "email": "daniel.thompson@example.com",
        "address": "789 Apple St",
        "city": "Denver",
        "state": "CO",
        "country": "USA",
        "phone": "+1 901-234-5678",
        "occupation": "Physicist",
    },
    {
        "name": "Sophia Clark",
        "age": 27,
        "major": "Sociology",
        "email": "sophia.clark@example.com",
        "address": "234 Orange Ln",
        "city": "Austin",
        "state": "TX",
        "country": "USA",
        "phone": "+1 012-345-6789",
        "occupation": "Social Worker",
    },
    {
        "name": "Christopher Lee",
        "age": 33,
        "major": "Chemistry",
        "email": "christopher.lee@example.com",
        "address": "567 Mango St",
        "city": "San Diego",
        "state": "CA",
        "country": "USA",
        "phone": "+1 123-456-7890",
        "occupation": "Chemist",
    },
    {
        "name": "Ava Green",
        "age": 28,
        "major": "History",
        "email": "ava.green@example.com",
        "address": "890 Cherry Ave",
        "city": "Philadelphia",
        "state": "PA",
        "country": "USA",
        "phone": "+1 234-567-8901",
        "occupation": "Historian",
    },
    {
        "name": "Ethan Anderson",
        "age": 30,
        "major": "Business",
        "email": "ethan.anderson@example.com",
        "address": "123 Lemon Ln",
        "city": "Houston",
        "state": "TX",
        "country": "USA",
        "phone": "+1 345-678-9012",
        "occupation": "Entrepreneur",
    },
    {
        "name": "Isabella Carter",
        "age": 28,
        "major": "Mathematics",
        "email": "isabella.carter@example.com",
        "address": "456 Grape St",
        "city": "Phoenix",
        "state": "AZ",
        "country": "USA",
        "phone": "+1 456-789-0123",
        "occupation": "Mathematician",
    },
    {
        "name": "Andrew Walker",
        "age": 32,
        "major": "Economics",
        "email": "andrew.walker@example.com",
        "address": "789 Berry Ave",
        "city": "Portland",
        "state": "OR",
        "country": "USA",
        "phone": "+1 567-890-1234",
        "occupation": "Economist",
    },
    {
        "name": "Mia Evans",
        "age": 29,
        "major": "Political Science",
        "email": "mia.evans@example.com",
        "address": "234 Lime St",
        "city": "Washington",
        "state": "DC",
        "country": "USA",
        "phone": "+1 678-901-2345",
        "occupation": "Political Analyst",
    },
]
#%%
llm = AzureOpenAI(
    engine="...",
    model="gpt-4",
    temperature=0.0,
    azure_deployment="...",
    azure_endpoint="...",
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-03-01-preview",
)
#%%
json_stats_query_engine = JSONalyzeQueryEngine(
    list_of_dict=json_list,
    llm=llm,
    verbose=True,
)
#%%
questions = [
    "What is the average age of the individuals in the dataset?",
    "What is the maximum age among the individuals?",
    "What is the minimum age among the individuals?",
    "How many individuals have a major in Psychology?",
    "What is the most common major among the individuals?",
    "What is the percentage of individuals residing in California (CA)?",
    "How many individuals have an occupation related to science or engineering?",
    "What is the average length of the email addresses in the dataset?",
    "How many individuals have a phone number starting with '+1 234'?",
    "What is the distribution of ages among the individuals?",
]
#%%
display(
    Markdown("> Question: {}".format(questions[0])),
    Markdown("Answer: {}".format(json_stats_query_engine.query(questions[0]))),
)
#%%
display(
    Markdown("> Question: {}".format(questions[4])),
    Markdown("Answer: {}".format(json_stats_query_engine.query(questions[4]))),
)
#%%
display(
    Markdown("> Question: {}".format(questions[7])),
    Markdown("Answer: {}".format(json_stats_query_engine.query(questions[7]))),
)
#%%
display(
    Markdown("> Question: {}".format(questions[5])),
    Markdown("Answer: {}".format(json_stats_query_engine.query(questions[5]))),
)
#%%
display(
    Markdown("> Question: {}".format(questions[9])),
    Markdown("Answer: {}".format(json_stats_query_engine.query(questions[9]))),
)
#%%
# e2e test async

json_stats_aquery_engine = JSONalyzeQueryEngine(
    list_of_dict=json_list,
    llm=llm,
    verbose=True,
    use_async=True,
)
#%%
display(
    Markdown("> Question: {}".format(questions[7])),
    Markdown("Answer: {}".format(json_stats_query_engine.query(questions[7]))),
)
#%%

llama-index==0.10.38
llama-index-agent-openai==0.2.5
llama-index-cli==0.1.12
llama-index-core==0.10.38.post2
llama-index-embeddings-azure-openai==0.1.9
llama-index-embeddings-huggingface==0.2.0
llama-index-embeddings-instructor==0.1.3
llama-index-embeddings-openai==0.1.10
llama-index-extractors-entity==0.1.2
llama-index-indices-managed-llama-cloud==0.1.6
llama-index-legacy==0.9.48
llama-index-llms-azure-openai==0.1.8
llama-index-llms-openai==0.1.20
llama-index-multi-modal-llms-openai==0.1.6
llama-index-postprocessor-rankllm-rerank==0.1.3
llama-index-program-openai==0.1.6
llama-index-question-gen-openai==0.1.3
llama-index-readers-file==0.1.23
llama-index-readers-llama-parse==0.1.4
llama-index-vector-stores-faiss==0.1.2
llama-parse==0.4.3
llamaindex-py-client==0.1.19

Python version 3.11.9.

Relevant Logs/Tracbacks

Query: What is the average age of the individuals in the dataset?
Error executing query: sql
SELECT AVG(age) AS average_age FROM items;
---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
File /app/local/myusername/conda_envs/py311agentsds/lib/python3.11/site-packages/llama_index/core/query_engine/jsonalyze_query_engine.py:107, in default_jsonalyzer(list_of_dict, query_bundle, llm, table_name, prompt, sql_parser)
    105 try:
    106     # Execute the SQL query
--> 107     results = list(db.query(sql_query))
    108 except sqlite_utils.utils.sqlite3.OperationalError as exc:

File /app/local/myusername/conda_envs/py311agentsds/lib/python3.11/site-packages/sqlite_utils/db.py:503, in Database.query(self, sql, params)
    496 """
    497 Execute ``sql`` and return an iterable of dictionaries representing each row.
    498 
   (...)
    501   parameters, or a dictionary for ``where id = :id``
    502 """
--> 503 cursor = self.execute(sql, params or tuple())
    504 keys = [d[0] for d in cursor.description]

File /app/local/myusername/conda_envs/py311agentsds/lib/python3.11/site-packages/sqlite_utils/db.py:521, in Database.execute(self, sql, parameters)
    520 if parameters is not None:
--> 521     return self.conn.execute(sql, parameters)
    522 else:

OperationalError: near "sql": syntax error

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
/app/experiments/tony/agent-sds-regulations/src/hello_world/jsonalyze_llamaindex_toy.py in line 4
      240 #%%
      241 display(
      242     Markdown("> Question: {}".format(questions[0])),
----> 243     Markdown("Answer: {}".format(json_stats_query_engine.query(questions[0]))),
      244 )

File /app/local/myusername/conda_envs/py311agentsds/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:274, in Dispatcher.span..wrapper(func, instance, args, kwargs)
    270 self.span_enter(
    271     id_=id_, bound_args=bound_args, instance=instance, parent_id=parent_id
    272 )
    273 try:
--> 274     result = func(*args, **kwargs)
    275 except BaseException as e:
    276     self.event(SpanDropEvent(span_id=id_, err_str=str(e)))

File /app/local/myusername/conda_envs/py311agentsds/lib/python3.11/site-packages/llama_index/core/base/base_query_engine.py:53, in BaseQueryEngine.query(self, str_or_query_bundle)
     51     if isinstance(str_or_query_bundle, str):
     52         str_or_query_bundle = QueryBundle(str_or_query_bundle)
---> 53     query_result = self._query(str_or_query_bundle)
     54 dispatch_event(QueryEndEvent(query=str_or_query_bundle, response=query_result))
     55 return query_result

File /app/local/myusername/conda_envs/py311agentsds/lib/python3.11/site-packages/llama_index/core/query_engine/jsonalyze_query_engine.py:288, in JSONalyzeQueryEngine._query(self, query_bundle)
    285     print_text(f"Query: {query}\n", color="green")
    287 # Perform the analysis
--> 288 sql_query, table_schema, results = self._analyzer(
    289     self._list_of_dict,
    290     query_bundle,
    291     self._llm,
    292     table_name=self._table_name,
    293     prompt=self._jsonalyze_prompt,
    294     sql_parser=self._sql_parser,
    295 )
    296 if self._verbose:
    297     print_text(f"SQL Query: {sql_query}\n", color="blue")

File /app/local/myusername/conda_envs/py311agentsds/lib/python3.11/site-packages/llama_index/core/query_engine/jsonalyze_query_engine.py:110, in default_jsonalyzer(list_of_dict, query_bundle, llm, table_name, prompt, sql_parser)
    108 except sqlite_utils.utils.sqlite3.OperationalError as exc:
    109     print_text(f"Error executing query: {sql_query}")
--> 110     raise ValueError("Invalid query") from exc
    112 return sql_query, table_schema, results

ValueError: Invalid query

dosubot[bot] commented 1 week ago

Hey there, @tpanza! I'm here to help you with any issues or questions you have while we wait for a human maintainer. Let's squash those bugs together! 🤖

To resolve the OperationalError: near "sql": syntax error when using Azure OpenAI with the JSONalyze Query Engine from the Llama Index library, you need to ensure that the SQL query generated by the LLM is valid. The error typically indicates that the SQL syntax is incorrect.

Here are the steps to troubleshoot and resolve the issue:

Check the Generated SQL Query: Print the SQL query generated by the LLM to verify its correctness.
Validate the SQL Syntax: Ensure that the SQL query adheres to the correct syntax for SQLite.
Adjust the Prompt: Modify the prompt used to generate the SQL query to provide better context or constraints to the LLM.

Here is an example of how you can print the SQL query for debugging purposes:

try:
    # Execute the SQL query
    results = list(db.query(sql_query))
except sqlite_utils.db_exceptions.OperationalError as exc:
    print_text(f"Error executing query: {sql_query}")
    raise ValueError(f"Invalid query: {sql_query}") from exc

Additionally, you can adjust the prompt to provide more specific instructions to the LLM:

prompt = prompt or DEFAULT_JSONALYZE_PROMPT
response_str = await service_context.llm.apredict(
    prompt=prompt,
    table_name=table_name,
    table_schema=table_schema,
    question=query,
)

# Print the response string for debugging
print_text(f"Generated SQL Query: {response_str}")

sql_query = sql_parser.parse_response_to_sql(response_str, query_bundle)

By printing the generated SQL query, you can identify any syntax errors and adjust the prompt or the SQL parser accordingly.

The JSONalyzeQueryEngine class in the Llama Index library provides methods to handle both synchronous and asynchronous query execution. Ensure you are using the correct method (_query for synchronous or _aquery for asynchronous) based on your implementation. Here is a snippet from the JSONalyzeQueryEngine class that shows how the SQL query is generated and executed:

def _query(self, query_bundle: QueryBundle) -> Response:
    query = query_bundle.query_str
    if self._verbose:
        print_text(f"Query: {query}\n", color="green")

    # Perform the analysis
    sql_query, table_schema, results = self._analyzer(
        self._list_of_dict,
        query_bundle,
        self._service_context,
        table_name=self._table_name,
        prompt=self._jsonalyze_prompt,
        sql_parser=self._sql_parser,
    )
    if self._verbose:
        print_text(f"SQL Query: {sql_query}\n", color="blue")
        print_text(f"Table Schema: {table_schema}\n", color="cyan")
        print_text(f"SQL Response: {results}\n", color="yellow")

    if self._synthesize_response:
        response_str = self._service_context.llm.predict(
            self._response_synthesis_prompt,
            sql_query=sql_query,
            table_schema=table_schema,
            sql_response=results,
            query_str=query_bundle.query_str,
        )
        if self._verbose:
            print_text(f"Response: {response_str}", color="magenta")
    else:
        response_str = str(results)
    response_metadata = {"sql_query": sql_query, "table_schema": str(table_schema)}

    return Response(response=response_str, metadata=response_metadata)

By following these steps and utilizing the provided methods, you should be able to resolve the OperationalError and ensure that the SQL queries generated by the LLM are valid and executable ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 week ago

Error executing query: sql so the LLM wrote the query as sql -- which is not a valid query or valid sql 😅

tpanza commented 1 week ago

Error executing query: sql so the LLM wrote the query as sql -- which is not a valid query or valid sql 😅

I'm using gpt-4-0125-Preview, deployed in Azure, as the LLM.

Looks like the full query was:

sql
SELECT AVG(age) AS average_age FROM items;

So, somehow need to remove the wrongly inserted sql from the beginning

guptadhaval18 commented 3 days ago

Hi @logan-markewich, as the issue is happening for given GPT. I think as a solution, we can resolve the issue by modifying the given line response_synthesis_prompt: Optional[BasePromptTemplate] = None in JSONalyzeQueryEngine function with response_synthesis_prompt: Optional[Union[BasePromptTemplate,PromptTemplate]] = None. As far as I can see, the given datatype will be handled properly and give correct result on modified prompts. Doing this is the best case as down the line if someone else encounters some other error on different LLM they can modify the prompt.

Or can create a separate argument to pass to the class for prompt modification as this will not require remembering to pass the text variables which are already mentioned in DEFAULT_JSONALYZE_PROMPT

Let me know if this seems logical, will create merge request

run-llama / llama_index