run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.31k stars 4.66k forks source link

[Documentation]: Document the pipeline used by NLSQLTableQueryEngine (and potentially other query engines) #14050

Open szhydkov opened 3 weeks ago

szhydkov commented 3 weeks ago

Documentation Issue Description

The original question in the discussion remain unanswered, so I'm not sure if it is something obvious or if it is indeed a gap in documentation that needs to be corrected.

I have gone through the docs, examples and source code and it's still unobvious to be what exactly happens when you query() a NLSQLTableQueryEngine. From what I gathered, an SQL retriever forms a prompt by using a predefined prompt and attaching the schema as the context, sends that to the LLM, gets an SQL query, gets the result from the SQL server and attaches that to the final prompt sent to the LLM. However, the only mention of the schema I find seems to produce a huge verbose schema that doesn't fit in the context window.

The whole process should be documented, this shouldn't take more that a few lines.

Many thanks!

Documentation Link

https://docs.llamaindex.ai/en/stable/api_reference/query_engine/NL_SQL_table/?h=nlsqltablequeryengine#llama_index.core.query_engine.NLSQLTableQueryEngine

dosubot[bot] commented 3 weeks ago

Hey @szhydkov! I'm here to help you with any bugs, questions, or contributions you have. Let's tackle this issue together!

The NLSQLTableQueryEngine in the Llama Index API uses a pipeline to process natural language queries into SQL queries and then execute them. Here is a detailed explanation of the process that occurs when the query() method is called:

  1. Initialization:

    • The NLSQLTableQueryEngine is initialized with various parameters including the SQL database, prompt templates, context query arguments, and flags for response synthesis and verbosity.
    • An instance of NLSQLRetriever is created and assigned to _sql_retriever.
  2. Query Execution:

    • When the query() method is called, it internally calls the _query() method of the BaseSQLTableQueryEngine class.
    • The _query() method first retrieves the SQL query and metadata by calling retrieve_with_metadata() on the sql_retriever.
  3. SQL Query Retrieval:

    • The retrieve_with_metadata() method of NLSQLRetriever processes the natural language query to generate an SQL query.
    • The SQL query is then executed against the SQL database, and the results are retrieved.
  4. Response Synthesis:

    • If synthesize_response is set to True, the _query() method synthesizes a response using the response_synthesis_prompt.
    • The get_response_synthesizer() function is used to create a response synthesizer, which formats the SQL query and its results into a coherent response.
    • The synthesized response is then returned as the final output.
  5. Return Response:

    • If synthesize_response is False, the raw SQL query results are returned directly.
    • The final response, either synthesized or raw, is encapsulated in a Response object and returned.

Here is the relevant code snippet illustrating this process:

class NLSQLTableQueryEngine(BaseSQLTableQueryEngine):
    def __init__(self, sql_database: SQLDatabase, text_to_sql_prompt: Optional[BasePromptTemplate] = None, context_query_kwargs: Optional[dict] = None, synthesize_response: bool = True, response_synthesis_prompt: Optional[BasePromptTemplate] = None, tables: Optional[Union[List[str], List[Table]]] = None, service_context: Optional[ServiceContext] = None, context_str_prefix: Optional[str] = None, sql_only: bool = False, verbose: bool = False, **kwargs: Any) -> None:
        self._sql_retriever = NLSQLRetriever(sql_database, text_to_sql_prompt=text_to_sql_prompt, context_query_kwargs=context_query_kwargs, tables=tables, context_str_prefix=context_str_prefix, service_context=service_context, sql_only=sql_only, verbose=verbose)
        super().__init__(synthesize_response=synthesize_response, response_synthesis_prompt=response_synthesis_prompt, service_context=service_context, verbose=verbose, **kwargs)

    @property
    def sql_retriever(self) -> NLSQLRetriever:
        return self._sql_retriever

class BaseSQLTableQueryEngine(BaseQueryEngine):
    def _query(self, query_bundle: QueryBundle) -> Response:
        retrieved_nodes, metadata = self.sql_retriever.retrieve_with_metadata(query_bundle)
        sql_query_str = metadata["sql_query"]
        if self._synthesize_response:
            partial_synthesis_prompt = self._response_synthesis_prompt.partial_format(sql_query=sql_query_str)
            response_synthesizer = get_response_synthesizer(llm=self._llm, callback_manager=self.callback_manager, text_qa_template=partial_synthesis_prompt, refine_template=self._refine_synthesis_prompt, verbose=self._verbose, streaming=self._streaming)
            response = response_synthesizer.synthesize(query=query_bundle.query_str, nodes=retrieved_nodes)
            cast(Dict, response.metadata).update(metadata)
            return cast(Response, response)
        else:
            response_str = "\n".join([node.node.text for node in retrieved_nodes])
            return Response(response=response_str, metadata=metadata)

Additionally, the Llama Index documentation provides an advanced guide on setting up a text-to-SQL pipeline, which includes:

  1. Query-Time Table Retrieval: Dynamically retrieve relevant tables in the text-to-SQL prompt.
  2. Query-Time Sample Row Retrieval: Embed/Index each row and dynamically retrieve example rows for each table in the text-to-SQL prompt.

The guide demonstrates how to load and ingest data, define core modules (Object index, retriever, SQLDatabase, Text-to-SQL Prompt, Response synthesis Prompt, LLM), and set up an end-to-end text-to-SQL pipeline with table retrieval and row retrieval capabilities. It also includes code examples for indexing tables, defining query pipelines, and running queries.

For more detailed information, you can refer to the Llama Index documentation [1][2][3].

To continue talking to Dosu, mention @dosu.

szhydkov commented 3 weeks ago

Thanks Dosu, however it still doesn't answer my question completely. Consider code like this:

# engine is a mysql connection
sql_database = SQLDatabase(engine)
# llm is a groq llama3-70b model
query_engine = NLSQLTableQueryEngine(sql_database=sql_database, llm=llm) 
response = query_engine.query("How many entries are there in the data?")

How do I see exactly what prompt is being sent to generate the SQL query? When logging is set to INFO, the schema is printed but it has 50k tokens, so I don't understand how that can be sent to the model provider (groq).

dosubot[bot] commented 3 weeks ago

To see the exact prompt being sent to generate the SQL query when using the NLSQLTableQueryEngine, you can enable verbose logging. This will log the details of the prompt, including the SQL query generated from the natural language input. Here is how you can do it:

from llama_index.core.utilities.sql_wrapper import SQLDatabase
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts.default_prompts import DEFAULT_TEXT_TO_SQL_PROMPT
from llama_index.core.indices.struct_store.sql_query import NLSQLTableQueryEngine

# Initialize your SQLDatabase and LLM instances
sql_database = SQLDatabase(engine)
llm = LLM(model="groq-llama3-70b")

# Initialize the NLSQLTableQueryEngine with verbose logging
query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    llm=llm,
    text_to_sql_prompt=DEFAULT_TEXT_TO_SQL_PROMPT,
    verbose=True
)

# Now, when you run a query, the prompt details will be logged
response = query_engine.query("How many entries are there in the data?")

By setting verbose=True, the NLSQLTableQueryEngine will log the details of the prompt, including the SQL query generated from the natural language input [1].

Regarding the large schema, the NLSQLTableQueryEngine handles large schemas by allowing you to specify a subset of tables using the tables parameter. This can help truncate or summarize the schema before sending it to the LLM. Here is an example:

# Specify a subset of tables to include in the context
tables = ["table1", "table2"]

# Initialize the NLSQLTableQueryEngine with the specified tables
query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    llm=llm,
    text_to_sql_prompt=DEFAULT_TEXT_TO_SQL_PROMPT,
    tables=tables,
    verbose=True
)

# Run the query
response = query_engine.query("How many entries are there in the data?")

This way, you can manage the schema size and ensure that the prompt fits within the context window of the model provider [2][1].

To continue talking to Dosu, mention @dosu.

szhydkov commented 3 weeks ago

I couldn't get it to work with verbose=True, however setting the logging level to DEBUG helps: it logs the request sent to the provider API. This reveals that using the full schema is around 5k tokens, so it fits in the 8k token window. If I wanted to still reduce the prompt size by using a more concise representation of the schema, how should I do that?

dosubot[bot] commented 3 weeks ago

To reduce the prompt size by using a more concise representation of the schema, you can focus on the essential parts of the schema and remove redundant or less critical information. Here is a more concise representation of the schema:

from typing import Any, Dict, List, Optional
from dataclasses import dataclass
from enum import Enum, auto
import uuid
import json

class BaseComponent:
    @classmethod
    def class_name(cls) -> str:
        return "base_component"

    def to_dict(self, **kwargs: Any) -> Dict[str, Any]:
        data = self.dict(**kwargs)
        data["class_name"] = self.class_name()
        return data

    def to_json(self, **kwargs: Any) -> str:
        return json.dumps(self.to_dict(**kwargs))

    @classmethod
    def from_dict(cls, data: Dict[str, Any], **kwargs: Any) -> "BaseComponent":
        data.pop("class_name", None)
        return cls(**data)

    @classmethod
    def from_json(cls, data_str: str, **kwargs: Any) -> "BaseComponent":
        return cls.from_dict(json.loads(data_str), **kwargs)

class ObjectType(str, Enum):
    TEXT = auto()
    IMAGE = auto()
    INDEX = auto()
    DOCUMENT = auto()

class MetadataMode(str, Enum):
    ALL = "all"
    EMBED = "embed"
    LLM = "llm"
    NONE = "none"

class BaseNode(BaseComponent):
    id_: str = str(uuid.uuid4())
    embedding: Optional[List[float]] = None
    metadata: Dict[str, Any] = {}
    relationships: Dict[str, Any] = {}

    @classmethod
    def get_type(cls) -> str:
        return "base_node"

    def get_content(self, metadata_mode: MetadataMode = MetadataMode.ALL) -> str:
        return self.metadata.get("content", "")

    def get_metadata_str(self, mode: MetadataMode = MetadataMode.ALL) -> str:
        return json.dumps(self.metadata)

    @property
    def hash(self) -> str:
        return sha256((self.get_content() + str(self.metadata)).encode()).hexdigest()

class TextNode(BaseNode):
    text: str = ""

    @classmethod
    def get_type(cls) -> str:
        return ObjectType.TEXT

    def get_content(self, metadata_mode: MetadataMode = MetadataMode.NONE) -> str:
        return self.text

class Document(TextNode):
    id_: str = str(uuid.uuid4())

    @classmethod
    def get_type(cls) -> str:
        return ObjectType.DOCUMENT

    @classmethod
    def example(cls) -> "Document":
        return Document(text="Sample text", metadata={"filename": "README.md", "category": "codebase"})

    @classmethod
    def class_name(cls) -> str:
        return "Document"

This concise representation retains the core functionality and structure of the schema while omitting less critical details. Additionally, you can use prompt templates to further streamline the schema representation, as shown in the LlamaIndex example:

from llama_index.core.prompts.base import PromptTemplate

DEFAULT_SUMMARY_PROMPT_TMPL = (
    "Write a summary of the following. Try to use only the "
    "information provided. "
    "Try to include as many key details as possible.\n"
    "\n"
    "\n"
    "{context_str}\n"
    "\n"
    "\n"
    'SUMMARY:"""\n'
)

DEFAULT_SUMMARY_PROMPT = PromptTemplate(
    DEFAULT_SUMMARY_PROMPT_TMPL, prompt_type=PromptType.SUMMARY
)

DEFAULT_INSERT_PROMPT_TMPL = (
    "Context information is below. It is provided in a numbered list "
    "(1 to {num_chunks}), "
    "where each item in the list corresponds to a summary.\n"
    "---------------------\n"
    "{context_list}"
    "---------------------\n"
    "Given the context information, here is a new piece of "
    "information: {new_chunk_text}\n"
    "Answer with the number corresponding to the summary that should be updated. "
    "The answer should be the number corresponding to the "
    "summary that is most relevant to the question.\n"
)
DEFAULT_INSERT_PROMPT = PromptTemplate(
    DEFAULT_INSERT_PROMPT_TMPL, prompt_type=PromptType.TREE_INSERT
)

DEFAULT_QUERY_PROMPT_TMPL = (
    "Some choices are given below. It is provided in a numbered list "
    "(1 to {num_chunks}), "
    "where each item in the list corresponds to a summary.\n"
    "---------------------\n"
    "{context_list}"
    "\n---------------------\n"
    "Using only the choices above and not prior knowledge, return "
    "the choice that is most relevant to the question: '{query_str}'\n"
    "Provide choice in the following format: 'ANSWER: <number>' and explain why "
    "this summary was selected in relation to the question.\n"
)
DEFAULT_QUERY_PROMPT = PromptTemplate(
    DEFAULT_QUERY_PROMPT_TMPL, prompt_type=PromptType.TREE_SELECT
)

DEFAULT_REFINE_PROMPT_TMPL = (
    "The original query is as follows: {query_str}\n"
    "We have provided an existing answer: {existing_answer}\n"
    "We have the opportunity to refine the existing answer "
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{context_msg}\n"
    "------------\n"
    "Given the new context, refine the original answer to better "
    "answer the query. "
    "If the context isn't useful, return the original answer.\n"
    "Refined Answer: "
)
DEFAULT_REFINE_PROMPT = PromptTemplate(
    DEFAULT_REFINE_PROMPT_TMPL, prompt_type=PromptType.REFINE
)

DEFAULT_TEXT_QA_PROMPT_TMPL = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)
DEFAULT_TEXT_QA_PROMPT = PromptTemplate(
    DEFAULT_TEXT_QA_PROMPT_TMPL, prompt_type=PromptType.QUESTION_ANSWER
)

DEFAULT_TEXT_TO_SQL_PGVECTOR_TMPL = """\
Given an input question, first create a syntactically correct {dialect} \
query to run, then look at the results of the query and return the answer. \
You can order the results by a relevant column to return the most \
interesting examples in the database.

Pay attention to use only the column names that you can see in the schema \
description. Be careful to not query for columns that do not exist. \
Pay attention to which column is in which table. Also, qualify column names \
with the table name when needed.

IMPORTANT NOTE: you can use specialized pgvector syntax (`<->`) to do nearest \
neighbors/semantic search to a given vector from an embeddings column in the table. \
The embeddings value for a given row typically represents the semantic meaning of that row. \
The vector represents an embedding representation \
of the question, given below. Do NOT fill in the vector values directly, but rather specify a \
`[query_vector]` placeholder. For instance, some select statement examples below \
(the name of the embeddings column is `embedding`):
SELECT * FROM items ORDER BY embedding <-> '[query_vector]' LIMIT 5;
SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;
SELECT * FROM items WHERE embedding <-> '[query_vector]' < 5;

You are required to use the following format, \
each taking one line:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use tables listed below.
{schema}

Question: {query_str}
SQLQuery: \
"""

DEFAULT_TEXT_TO_SQL_PGVECTOR_PROMPT = PromptTemplate(
    DEFAULT_TEXT_TO_SQL_PGVECTOR_TMPL,
    prompt_type=PromptType.TEXT_TO_SQL,
)

DEFAULT_TABLE_CONTEXT_TMPL = (
    "We have provided a table schema below. "
    "---------------------\n"
    "{schema}\n"
    "---------------------\n"
    "We have also provided context information below. "
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and the table schema, "
    "give a response to the following task: {query_str}"
)

DEFAULT_TABLE_CONTEXT_PROMPT = PromptTemplate(
    DEFAULT_TABLE_CONTEXT_TMPL, prompt_type=PromptType.TABLE_CONTEXT
)

DEFAULT_REFINE_TABLE_CONTEXT_TMPL = (
    "We have provided a table schema below. "
    "---------------------\n"
    "{schema}\n"
    "---------------------\n"
    "We have also provided some context information below. "
    "{context_msg}\n"
    "---------------------\n"
    "Given the context information and the table schema, "
    "give a response to the following task: {query_str}\n"
    "We have provided an existing answer: {existing_answer}\n"
    "Given the new context, refine the original answer to better "
    "answer the question. "
    "If the context isn't useful, return the original answer."
)
DEFAULT_REFINE_TABLE_CONTEXT_PROMPT = PromptTemplate(
    DEFAULT_REFINE_TABLE_CONTEXT_TMPL, prompt_type=PromptType.TABLE_CONTEXT
)

DEFAULT_KG_TRIPLET_EXTRACT_TMPL = (
    "Some text is provided below. Given the text, extract up to "
    "{max_knowledge_triplets} "
    "knowledge triplets in the form of (subject, predicate, object). Avoid stopwords.\n"
    "---------------------\n"
    "Example:"
    "Text: Alice is Bob's mother."
    "Triplets:\n(Alice, is mother of, Bob)\n"
    "Text: Philz is a coffee shop founded in Berkeley in 1982.\n"
    "Triplets:\n"
    "(Philz, is, coffee shop)\n"
    "(Philz, founded in, Berkeley)\n"
    "(Philz, founded in, 1982)\n"
    "---------------------\n"
    "Text: {text}\n"
    "Triplets:\n"
)
DEFAULT_KG_TRIPLET_EXTRACT_PROMPT = PromptTemplate(
    DEFAULT_KG_TRIPLET_EXTRACT_TMPL,
    prompt_type=PromptType.KNOWLEDGE_TRIPLET_EXTRACT,
)

DEFAULT_HYDE_PROMPT = PromptTemplate(
    "Please write a passage to answer the question\n"
    "Try to include as many key details as possible.\n"
    "\n"
    "\n"
    "{context_str}\n"
    "\n"
    "\n"
    'Passage:"""\n', prompt_type=PromptType.SUMMARY
)

DEFAULT_SIMPLE_INPUT_PROMPT = PromptTemplate(
    "{query_str}", prompt_type=PromptType.SIMPLE_INPUT
)

DEFAULT_JSON_PATH_PROMPT = PromptTemplate(
    "We have provided a JSON schema below:\n"
    "{schema}\n"
    "Given a task, respond with a JSON Path query that "
    "can retrieve data from a JSON value that matches the schema.\n"
    "Provide the JSON Path query in the following format: 'JSONPath: <JSONPath>'\n"
    "You must include the value 'JSONPath:' before the provided JSON Path query."
    "Example Format:\n"
    "Task: What is John's age?\n"
    "Response: JSONPath: $.John.age\n"
    "Let's try this now: \n\n"
    "Task: {query_str}\n"
    "Response: ", prompt_type=PromptType.JSON_PATH
)

DEFAULT_CHOICE_SELECT_PROMPT = PromptTemplate(
    "A list of documents is shown below. Each document has a number next to it along "
    "with a summary of the document. A question is also provided. \n"
    "Respond with the numbers of the documents "
    "you should consult to answer the question, in order of relevance, as well \n"
    "as the relevance score. The relevance score is a number from 1-10 based on "
    "how relevant you think the document is to the question.\n"
    "Do not include any documents that are not relevant to the question. \n"
    "Example format: \n"
    "Document 1:\n<summary of document 1>\n\n"
    "Document 2:\n<summary of document 2>\n\n"
    "...\n\n"
    "Document 10:\n<summary of document 10>\n\n"
    "Question: <question>\n"
    "Answer:\n"
    "Doc: 9, Relevance: 7\n"
    "Doc: 3, Relevance: 4\n"
    "Doc: 7, Relevance: 3\n\n"
    "Let's try this now: \n\n"
    "{context_str}\n"
    "Question: {query_str}\n"
    "Answer:\n", prompt_type=PromptType.CHOICE_SELECT
)

RANKGPT_RERANK_PROMPT = PromptTemplate(
    "Search Query: {query}. \nRank the {num} passages above "
    "based on their relevance to the search query. The passages "
    "should be listed in descending order using identifiers. "
    "The most relevant passages should be listed first. "
    "The output format should be [] > [], e.g., [1] > [2]. "
    "Only response the ranking results, "
    "do not say any word or explain.", prompt_type=PromptType.RANKGPT_RERANK
)

DEFAULT_JSONALYZE_PROMPT = PromptTemplate(
    "You are given a table named: '{table_name}' with schema, "
    "generate SQLite SQL query to answer the given question.\n"
    "Table schema:\n"
    "{table_schema}\n"
    "Question: {question}\n\n"
    "SQLQuery: ", prompt_type=PromptType.TEXT_TO_SQL
)

Using prompt templates helps in maintaining a concise and organized schema representation [1][2][3].

To continue talking to Dosu, mention @dosu.