run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.23k stars 4.95k forks source link

[Question]: Pandas Query Engine does not give proper response( Especially when exposed via API, works well in local jupyter notebook) #14804

Open Sridevi17j opened 1 month ago

Sridevi17j commented 1 month ago

Question Validation

Question

Below is my code

from fastapi import FastAPI, HTTPException from pydantic import BaseModel import pandas as pd from IPython.display import Markdown, display import time

import pandas as pd from llama_index.llms.azure_openai import AzureOpenAI from llama_index.experimental.query_engine import PandasQueryEngin

Initialize the PandasQueryEngine

query_engine = PandasQueryEngine(df=df, llm=llm, verbose=True, synthesize_response=True)

class Query(BaseModel): query: str

app = FastAPI()

Load the CSV file

df = pd.read_csv("sample.csv")

@app.post("/query") async def process_query(query: Query): try: response = query_engine.query(query.query) return {"response": response.response} except Exception as e: raise HTTPException(status_code=500, detail=str(e))

if name == "main": import uvicorn uvicorn.run(app, host="0.0.0.0", port=7000)

I am using Azure OPenAI for LLM, the same code, without API, works very well, when I expose it via API, response is like below 
{

"query": "Give me the number of users in the document" } { "response": "The number of unique users in the document is [insert output from running the Pandas instruction]." }

dosubot[bot] commented 1 month ago

To resolve the issue where the placeholder [insert output from running the Pandas instruction] is not being replaced with the actual output when querying the Pandas Query Engine via a FastAPI endpoint, you need to ensure that the PandasInstructionParser function is correctly processing the output and that the PandasQueryEngine is being used properly.

Here is a code snippet that demonstrates how to use the PandasQueryEngine to parse and execute Pandas instructions:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas.output_parser import PandasInstructionParser
from llama_index.experimental.query_engine.pandas.pandas_query_engine import PandasQueryEngine

app = FastAPI()

# Load the CSV file
df = pd.read_csv("sample.csv")

# Initialize the PandasQueryEngine with the DataFrame
query_engine = PandasQueryEngine(df)
instruction_parser = PandasInstructionParser(df)

class Query(BaseModel):
    query: str

@app.post("/query")
async def process_query(query: Query):
    try:
        # Create a QueryBundle with the instruction
        query_bundle = QueryBundle(query_str=query.query)

        # Execute the query using the PandasQueryEngine
        response = query_engine._query(query_bundle)

        # Parse the output using PandasInstructionParser
        parsed_output = instruction_parser.parse(response.response)

        return {"response": parsed_output}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=7000)

In this example, the FastAPI endpoint /query accepts a POST request with a Pandas instruction as input. The PandasQueryEngine is used to parse and execute the instruction, and the result is processed by the PandasInstructionParser to ensure the placeholder is replaced with the actual output. Make sure that the query string contains valid Pandas code that can be executed on the provided DataFrame df [1][2][3].

To continue talking to Dosu, mention @dosu.

Sridevi17j commented 1 month ago

@dosu, we tried multiple ways, it does not work when it comes to API or a single python file. But the same code works in Jupyter notebook. I wonder how?

dosubot[bot] commented 1 month ago

The discrepancy you're experiencing is likely due to the way the PandasQueryEngine and PandasInstructionParser are being used in different environments. Here are a few key points to consider:

  1. Private Method Usage: Ensure you are not using private methods (those prefixed with an underscore) directly. Instead, use the public methods provided by the library.

  2. Async Execution: If you are using asynchronous methods, ensure they are awaited properly.

  3. Correct Imports: Make sure all necessary classes and functions are correctly imported.

  4. Security Considerations: Be aware that using eval in a production environment poses significant security risks.

Here is a revised version of your code that addresses these points:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas.output_parser import PandasInstructionParser
from llama_index.experimental.query_engine.pandas.pandas_query_engine import PandasQueryEngine
from llama_index.core.schema import QueryBundle  # Ensure this import is correct

app = FastAPI()

# Load the CSV file
df = pd.read_csv("sample.csv")

# Initialize the PandasQueryEngine with the DataFrame
query_engine = PandasQueryEngine(df)
instruction_parser = PandasInstructionParser(df)

class Query(BaseModel):
    query: str

@app.post("/query")
async def process_query(query: Query):
    try:
        # Create a QueryBundle with the instruction
        query_bundle = QueryBundle(query_str=query.query)

        # Execute the query using the PandasQueryEngine
        response = await query_engine._aquery(query_bundle)  # Use the async method

        # Parse the output using PandasInstructionParser
        parsed_output = instruction_parser.parse(response.response)

        return {"response": parsed_output}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=7000)

Ensure that your Python version is 3.9 or higher, as the default_output_processor function in PandasInstructionParser relies on features available in Python 3.9 and above [1][2][3].

To continue talking to Dosu, mention @dosu.

Sridevi17j commented 1 month ago

@dosu, This is my question- "What was the first activity".

This question works on Jupyter, and gives the first activity name. Because it interprets the question well and search for the column name "Operation Name". But the same code when I execute in python API, it searched for the column named " Activity" and giving me the Key Error, which is resulting in " The first activity cannot be determined as there is no information provided in the query results."

Your given code also does not work in API... what is happening ?

dosubot[bot] commented 1 month ago

The discrepancy where the Pandas Query Engine works in a Jupyter notebook but fails in a FastAPI endpoint could be due to differences in the environment or the way the code is executed. One common issue is the version of Python being used. The provided tests indicate that certain functionalities require Python 3.9 or higher.

Additionally, it appears that the PandasQueryEngine has been deprecated and moved to llama-index-experimental. This could be another reason for the failure in the FastAPI endpoint if the deprecated version is being used.

To resolve this issue, ensure that the FastAPI environment is using Python 3.9 or higher. You can check the Python version in your FastAPI environment and update it if necessary. Also, make sure to use the updated PandasQueryEngine from llama-index-experimental.

Here is a snippet to check the Python version in your FastAPI endpoint:

import sys
from fastapi import FastAPI

app = FastAPI()

@app.get("/check-python-version")
def check_python_version():
    return {"python_version": sys.version}

If the version is lower than 3.9, you will need to update your Python environment to 3.9 or higher.

To use the updated PandasQueryEngine, install the llama-index-experimental package and import the engine from there:

pip install llama-index-experimental

Then, update your import statements in the FastAPI endpoint:

from llama_index.experimental.query_engine.pandas import PandasQueryEngine

By ensuring the correct Python version and using the updated PandasQueryEngine, you should be able to resolve the discrepancy. Note that the PandasQueryEngine provides the Agent access to the eval function, which allows for arbitrary code execution. This tool is not recommended for use in a production setting without heavy sandboxing or virtual machines [1][2][3].

To continue talking to Dosu, mention @dosu.

Sridevi17j commented 1 month ago

@dosu, The environment is same for both Jupyter and FastAPIPython(Both uses same python version). What I understood is when executed in Jupyter it interprets the prompt to right query... like from activity to correct column name " Operation Name" .. But in Python file, "Activity" is being interpreted as activity itself and searches the column in the dataframe "Activity" and fails... If I give the right column name like "Operation Name" then it works in API python.. why so?

dosubot[bot] commented 1 month ago

The difference in prompt interpretation between Jupyter and the FastAPI endpoint is likely due to the handling of the prompt formatting. In the SageMakerLLM class, the complete method formats the prompt using the _completion_to_prompt function if the formatted parameter is set to False.

In Jupyter, you might be directly passing the prompt without additional formatting, while the FastAPI endpoint might be invoking the complete method with formatted set to False, causing the prompt to be reformatted.

To resolve this, ensure that the formatted parameter is consistently set to True when calling the complete method if you do not want the prompt to be reformatted. Here is the relevant part of the code:

@llm_completion_callback()
def complete(
    self, prompt: str, formatted: bool = False, **kwargs: Any
) -> CompletionResponse:
    model_kwargs = {**self.model_kwargs, **kwargs}
    if not formatted:
        prompt = self._completion_to_prompt(prompt, self.system_prompt)

    request_body = self.content_handler.serialize_input(prompt, model_kwargs)
    response = self._client.invoke_endpoint(
        EndpointName=self.endpoint_name,
        Body=request_body,
        ContentType=self.content_handler.content_type,
        Accept=self.content_handler.accept,
        **self.endpoint_kwargs,
    )

    response["Body"] = self.content_handler.deserialize_output(response["Body"])
    text = self.content_handler.remove_prefix(response["Body"], prompt)

    return CompletionResponse(
        text=text,
        raw=response,
        additional_kwargs={
            "model_kwargs": model_kwargs,
            "endpoint_kwargs": self.endpoint_kwargs,
        },
    )

Ensure that when you call the complete method, you set formatted=True if you want to avoid additional prompt formatting:

response = sage_maker_llm_instance.complete(prompt, formatted=True)

This should help maintain consistency in prompt interpretation between Jupyter and the FastAPI endpoint [1].

To continue talking to Dosu, mention @dosu.

Sridevi17j commented 1 month ago

@dosu, could you give me the complete code for the pandad query engine that should work on fast API, with the prompt formatting you just mentioned above?

dosubot[bot] commented 1 month ago

Here is the complete code for the Pandas Query Engine that should work with FastAPI, including the correct prompt formatting:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas import PandasQueryEngine
from llama_index.experimental.query_engine.pandas.output_parser import PandasInstructionParser
from llama_index.core.schema import QueryBundle
from llama_index.legacy.prompts import PromptTemplate

app = FastAPI()

# Load the CSV file
df = pd.read_csv("sample.csv")

# Initialize PandasQueryEngine with the DataFrame
query_engine = PandasQueryEngine(df=df, verbose=True)

# Update prompts if needed
new_prompt = PromptTemplate(
    """\
You are working with a pandas dataframe in Python.
The name of the dataframe is `df`.
This is the result of `print(df.head())`:
{df_str}

Follow these instructions:
{instruction_str}
Query: {query_str}

Expression: """
)
query_engine._update_prompts({"pandas_prompt": new_prompt})

class Query(BaseModel):
    query: str

@app.post("/query")
async def process_query(query: Query):
    try:
        # Create a QueryBundle with the instruction
        query_bundle = QueryBundle(query_str=query.query)

        # Execute the query using the PandasQueryEngine
        response = await query_engine._aquery(query_bundle)  # Use the async method

        # Parse the output using PandasInstructionParser
        instruction_parser = PandasInstructionParser(df)
        parsed_output = instruction_parser.parse(response.response)

        return {"response": parsed_output}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=7000)

This code sets up a FastAPI application that uses the PandasQueryEngine to process queries on a DataFrame loaded from a CSV file. The query endpoint takes a query string as input and returns the response after processing it with the PandasInstructionParser. The prompt formatting is updated to ensure consistent interpretation of the query.

Make sure to install the necessary dependencies and use Python 3.9 or higher for compatibility.

To continue talking to Dosu, mention @dosu.

Sridevi17j commented 1 month ago

@dosu, your code give below error { "detail": "1 validation error for LLMPredictStartEvent\ntemplate\n Can't instantiate abstract class BasePromptTemplate without an implementation for abstract methods 'format', 'format_messages', 'get_template', 'partial_format' (type=type_error)" }

dosubot[bot] commented 1 month ago

The error you're encountering indicates that the BasePromptTemplate class is being instantiated directly, which is not allowed because it contains abstract methods that need to be implemented.

To resolve this, ensure that you are using the correct PromptTemplate class from the llama_index.core.prompts module. Here is the corrected code:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas import PandasQueryEngine
from llama_index.experimental.query_engine.pandas.output_parser import PandasInstructionParser
from llama_index.core.schema import QueryBundle
from llama_index.core.prompts import PromptTemplate, PromptType

app = FastAPI()

# Load the CSV file
df = pd.read_csv("sample.csv")

# Initialize PandasQueryEngine with the DataFrame
query_engine = PandasQueryEngine(df=df, verbose=True)

# Define the default template string
DEFAULT_PANDAS_TMPL = (
    "You are working with a pandas dataframe in Python.\n"
    "The name of the dataframe is `df`.\n"
    "This is the result of `print(df.head())`:\n"
    "{df_str}\n\n"
    "Follow these instructions:\n"
    "{instruction_str}\n"
    "Query: {query_str}\n\n"
    "Expression:"
)

# Create the PromptTemplate instance
DEFAULT_PANDAS_PROMPT = PromptTemplate(
    DEFAULT_PANDAS_TMPL, prompt_type=PromptType.PANDAS
)

# Update the prompt template in the query engine
query_engine._update_prompts({"pandas_prompt": DEFAULT_PANDAS_PROMPT})

class Query(BaseModel):
    query: str

@app.post("/query")
async def process_query(query: Query):
    try:
        # Create a QueryBundle with the instruction
        query_bundle = QueryBundle(query_str=query.query)

        # Execute the query using the PandasQueryEngine
        response = await query_engine._aquery(query_bundle)  # Use the async method

        # Parse the output using PandasInstructionParser
        instruction_parser = PandasInstructionParser(df)
        parsed_output = instruction_parser.parse(response.response)

        return {"response": parsed_output}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=7000)

This code ensures that the PromptTemplate is correctly instantiated and used within the PandasQueryEngine. Make sure to replace "sample.csv" with the path to your actual CSV file. This should resolve the error related to the abstract class instantiation [1][2].

To continue talking to Dosu, mention @dosu.

Sridevi17j commented 1 month ago

@dosu, My prompt was- "what was the first activity"

and I get below error

Pandas Instructions:

df['Activity'].iloc[0]

Traceback (most recent call last): File "C:\Users\sride\PycharmProjects\Test AI.venv\Lib\site-packages\pandas\core\indexes\base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas\_libs\hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Activity'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\sride\PycharmProjects\Test AI.venv\Lib\site-packages\llama_index\experimental\query_engine\pandas\output_parser.py", line 54, in default_output_processor output_str = str(safe_eval(module_end_str, global_vars, local_vars)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sride\PycharmProjects\Test AI.venv\Lib\site-packages\llama_index\experimental\exec_utils.py", line 159, in safe_eval
return eval(__source, _get_restricted_globals(globals), locals) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 1, in File "C:\Users\sride\PycharmProjects\Test AI.venv\Lib\site-packages\pandas\core\frame.py", line 4102, in getitem indexer = self.columns.get_loc(key) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sride\PycharmProjects\Test AI.venv\Lib\site-packages\pandas\core\indexes\base.py", line 3812, in get_loc raise KeyError(key) from err KeyError: 'Activity'

Pandas Output: There was an error running the output as Python code. Error message: 'Activity' Traceback (most recent call last): File "C:\Users\sride\PycharmProjects\Test AI.venv\Lib\site-packages\llama_index\experimental\query_engine\pandas\output_parser.py", line 40, in default_output_processor tree = ast.parse(output) ^^^^^^^^^^^^^^^^^ File "C:\Users\sride\AppData\Local\Programs\Python\Python312\Lib\ast.py", line 52, in parse return compile(source, filename, mode, flags, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 1 There was an error running the output as Python code. Error message: 'Activity'

I don't know why the same code runs well with Jupyter without API, 
dosubot[bot] commented 1 month ago

The issue you're encountering is due to the difference in how the prompt is interpreted in Jupyter versus FastAPI. In Jupyter, the environment might be more interactive and flexible, allowing for better interpretation of the prompt. In contrast, FastAPI runs in a more controlled environment, which might lead to differences in how prompts are processed.

To ensure consistent prompt interpretation across both environments, you need to make sure that the prompt templates and the execution context are consistent. Here is an example of how you can set up the prompt templates and execution context in FastAPI to match the Jupyter environment:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas import PandasQueryEngine
from llama_index.core.prompts import PromptTemplate
from llama_index.core.schema import QueryBundle

app = FastAPI()

# Load the CSV file
df = pd.read_csv("sample.csv")

# Define the default instruction string and response synthesis prompt template
DEFAULT_INSTRUCTION_STR = (
    "1. Convert the query to executable Python code using Pandas.\n"
    "2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
    "3. The code should represent a solution to the query.\n"
    "4. PRINT ONLY THE EXPRESSION.\n"
    "5. Do not quote the expression.\n"
)

DEFAULT_RESPONSE_SYNTHESIS_PROMPT_TMPL = (
    "Given an input question, synthesize a response from the query results.\n"
    "Query: {query_str}\n\n"
    "Pandas Instructions (optional):\n{pandas_instructions}\n\n"
    "Pandas Output: {pandas_output}\n\n"
    "Response: "
)

DEFAULT_RESPONSE_SYNTHESIS_PROMPT = PromptTemplate(
    DEFAULT_RESPONSE_SYNTHESIS_PROMPT_TMPL,
)

# Initialize the PandasQueryEngine with the same prompts and settings
query_engine = PandasQueryEngine(
    df=df,
    instruction_str=DEFAULT_INSTRUCTION_STR,
    response_synthesis_prompt=DEFAULT_RESPONSE_SYNTHESIS_PROMPT,
    verbose=True
)

class Query(BaseModel):
    query: str

@app.post("/query")
async def process_query(query: Query):
    try:
        # Create a QueryBundle with the instruction
        query_bundle = QueryBundle(query_str=query.query)

        # Execute the query using the PandasQueryEngine
        response = await query_engine._aquery(query_bundle)  # Use the async method

        return {"response": response.response}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=7000)

By ensuring that the prompt templates (DEFAULT_INSTRUCTION_STR and DEFAULT_RESPONSE_SYNTHESIS_PROMPT) and the execution context are the same in both Jupyter and FastAPI, you can achieve consistent prompt interpretation across both environments [1].

To continue talking to Dosu, mention @dosu.

Sridevi17j commented 1 month ago

@dosu, no luck, still the same error

dosubot[bot] commented 1 month ago

@logan-markewich, it seems like this issue requires a deeper look, could you please lend your expertise?

To continue talking to Dosu, mention @dosu.

Sridevi17j commented 1 month ago

@logan-markewich , to explain the issue,

I'm encountering an issue with the pandas query engine behaving differently in various environments. Let me explain: The setup:

I have a simple pandas query engine code. My dataframe has a column named "Operation Name".

In Jupyter Notebook:

The code works perfectly. When I input the query "What was the first activity?", it correctly interprets "activity" as the "Operation Name" column. It searches the "Operation Name" column and returns the expected results.

However, in a .py file or when exposed as an API:

The same code doesn't work as expected. Using the same query "What was the first activity?", the query engine literally searches for an "Activity" column. Since there's no "Activity" column in my dataframe, it throws a KeyError. The error message is: "> Pandas Output: There was an error running the output as Python code. Error message: 'Activity'"

It seems the query engine's ability to map natural language to the correct column name is working in Jupyter, but failing in other environments. Do you have any insights on why this might be happening or how to resolve it?

Below is my code- I am using Azure OpenAI as an LLM( I am not giving LLM details here), but I have declared it in my code

`from llama_index.llms.azure_openai import AzureOpenAI from llama_index.experimental.query_engine.pandas.pandas_query_engine import PandasQueryEngine import pandas as pd

llm = AzureOpenAI()- I have given the details in my code

df= pd.read_csv("sample.csv")

query_engine = PandasQueryEngine(df=df,llm=llm,verbose=True,synthesize_response=True)

response = query_engine.query("What was the first activity") print(response.response)`

Sridevi17j commented 1 month ago

@dosu, @logan-markewich

Any help can be done on my issue pls....

dosubot[bot] commented 1 month ago

A maintainer has already been looped in.

To continue talking to Dosu, mention @dosu.

dosubot[bot] commented 1 month ago

A maintainer has already been looped in.

To continue talking to Dosu, mention @dosu.

Sridevi17j commented 1 month ago

@logan-markewich is it possible to help me on this?

@dosu, is it possible to get any update?

dosubot[bot] commented 1 month ago

The issue you're experiencing, where the Pandas Query Engine works correctly in Jupyter but fails in a standalone Python script or API, is likely due to differences in how the environments handle the interpretation of natural language queries.

To address this, ensure that the column names used in your queries match exactly with the column names in your DataFrame. This includes checking for any leading or trailing spaces, case sensitivity, and special characters. Here is a step-by-step approach to debug and resolve the issue:

  1. Verify DataFrame Columns: Print the DataFrame columns to ensure they match what you expect.
  2. Explicit Column Reference: Use explicit column references in your queries to avoid ambiguity.

Here is an updated version of your code with additional debugging steps:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas import PandasQueryEngine
from llama_index.core.schema import QueryBundle
from llama_index.llms.azure_openai import AzureOpenAI

app = FastAPI()

# Load the CSV file
df = pd.read_csv("sample.csv")

# Print DataFrame columns for debugging
print("DataFrame columns:", df.columns)

# Initialize the LLM (ensure you have the correct details in your actual code)
llm = AzureOpenAI(api_key="your_api_key", endpoint="your_endpoint", api_version="2023-05-15")

# Initialize PandasQueryEngine with the DataFrame and LLM
query_engine = PandasQueryEngine(df=df, llm=llm, verbose=True, synthesize_response=True)

class Query(BaseModel):
    query: str

@app.post("/query")
async def process_query(query: Query):
    try:
        # Create a QueryBundle with the instruction
        query_bundle = QueryBundle(query_str=query.query)

        # Execute the query using the PandasQueryEngine
        response = await query_engine._aquery(query_bundle)  # Use the async method

        return {"response": response.response}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=7000)

Debugging Steps:

  1. Print DataFrame Columns: Ensure the column names are as expected.
  2. Explicit Column Reference: Modify your query to explicitly reference the column name if needed.

If the issue persists, consider the following:

By following these steps, you should be able to identify and resolve the discrepancies in query interpretation between Jupyter and the FastAPI environment.

A maintainer has already been looped in.

To continue talking to Dosu, mention @dosu.