Open Sridevi17j opened 1 month ago
To resolve the issue where the placeholder [insert output from running the Pandas instruction]
is not being replaced with the actual output when querying the Pandas Query Engine via a FastAPI endpoint, you need to ensure that the PandasInstructionParser
function is correctly processing the output and that the PandasQueryEngine
is being used properly.
Here is a code snippet that demonstrates how to use the PandasQueryEngine
to parse and execute Pandas instructions:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas.output_parser import PandasInstructionParser
from llama_index.experimental.query_engine.pandas.pandas_query_engine import PandasQueryEngine
app = FastAPI()
# Load the CSV file
df = pd.read_csv("sample.csv")
# Initialize the PandasQueryEngine with the DataFrame
query_engine = PandasQueryEngine(df)
instruction_parser = PandasInstructionParser(df)
class Query(BaseModel):
query: str
@app.post("/query")
async def process_query(query: Query):
try:
# Create a QueryBundle with the instruction
query_bundle = QueryBundle(query_str=query.query)
# Execute the query using the PandasQueryEngine
response = query_engine._query(query_bundle)
# Parse the output using PandasInstructionParser
parsed_output = instruction_parser.parse(response.response)
return {"response": parsed_output}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=7000)
In this example, the FastAPI endpoint /query
accepts a POST request with a Pandas instruction as input. The PandasQueryEngine
is used to parse and execute the instruction, and the result is processed by the PandasInstructionParser
to ensure the placeholder is replaced with the actual output. Make sure that the query
string contains valid Pandas code that can be executed on the provided DataFrame df
[1][2][3].
To continue talking to Dosu, mention @dosu.
@dosu, we tried multiple ways, it does not work when it comes to API or a single python file. But the same code works in Jupyter notebook. I wonder how?
The discrepancy you're experiencing is likely due to the way the PandasQueryEngine
and PandasInstructionParser
are being used in different environments. Here are a few key points to consider:
Private Method Usage: Ensure you are not using private methods (those prefixed with an underscore) directly. Instead, use the public methods provided by the library.
Async Execution: If you are using asynchronous methods, ensure they are awaited properly.
Correct Imports: Make sure all necessary classes and functions are correctly imported.
Security Considerations: Be aware that using eval
in a production environment poses significant security risks.
Here is a revised version of your code that addresses these points:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas.output_parser import PandasInstructionParser
from llama_index.experimental.query_engine.pandas.pandas_query_engine import PandasQueryEngine
from llama_index.core.schema import QueryBundle # Ensure this import is correct
app = FastAPI()
# Load the CSV file
df = pd.read_csv("sample.csv")
# Initialize the PandasQueryEngine with the DataFrame
query_engine = PandasQueryEngine(df)
instruction_parser = PandasInstructionParser(df)
class Query(BaseModel):
query: str
@app.post("/query")
async def process_query(query: Query):
try:
# Create a QueryBundle with the instruction
query_bundle = QueryBundle(query_str=query.query)
# Execute the query using the PandasQueryEngine
response = await query_engine._aquery(query_bundle) # Use the async method
# Parse the output using PandasInstructionParser
parsed_output = instruction_parser.parse(response.response)
return {"response": parsed_output}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=7000)
Ensure that your Python version is 3.9 or higher, as the default_output_processor
function in PandasInstructionParser
relies on features available in Python 3.9 and above [1][2][3].
To continue talking to Dosu, mention @dosu.
@dosu, This is my question- "What was the first activity".
This question works on Jupyter, and gives the first activity name. Because it interprets the question well and search for the column name "Operation Name". But the same code when I execute in python API, it searched for the column named " Activity" and giving me the Key Error, which is resulting in " The first activity cannot be determined as there is no information provided in the query results."
Your given code also does not work in API... what is happening ?
The discrepancy where the Pandas Query Engine works in a Jupyter notebook but fails in a FastAPI endpoint could be due to differences in the environment or the way the code is executed. One common issue is the version of Python being used. The provided tests indicate that certain functionalities require Python 3.9 or higher.
Additionally, it appears that the PandasQueryEngine
has been deprecated and moved to llama-index-experimental
. This could be another reason for the failure in the FastAPI endpoint if the deprecated version is being used.
To resolve this issue, ensure that the FastAPI environment is using Python 3.9 or higher. You can check the Python version in your FastAPI environment and update it if necessary. Also, make sure to use the updated PandasQueryEngine
from llama-index-experimental
.
Here is a snippet to check the Python version in your FastAPI endpoint:
import sys
from fastapi import FastAPI
app = FastAPI()
@app.get("/check-python-version")
def check_python_version():
return {"python_version": sys.version}
If the version is lower than 3.9, you will need to update your Python environment to 3.9 or higher.
To use the updated PandasQueryEngine
, install the llama-index-experimental
package and import the engine from there:
pip install llama-index-experimental
Then, update your import statements in the FastAPI endpoint:
from llama_index.experimental.query_engine.pandas import PandasQueryEngine
By ensuring the correct Python version and using the updated PandasQueryEngine
, you should be able to resolve the discrepancy. Note that the PandasQueryEngine
provides the Agent access to the eval
function, which allows for arbitrary code execution. This tool is not recommended for use in a production setting without heavy sandboxing or virtual machines [1][2][3].
To continue talking to Dosu, mention @dosu.
@dosu, The environment is same for both Jupyter and FastAPIPython(Both uses same python version). What I understood is when executed in Jupyter it interprets the prompt to right query... like from activity to correct column name " Operation Name" .. But in Python file, "Activity" is being interpreted as activity itself and searches the column in the dataframe "Activity" and fails... If I give the right column name like "Operation Name" then it works in API python.. why so?
The difference in prompt interpretation between Jupyter and the FastAPI endpoint is likely due to the handling of the prompt formatting. In the SageMakerLLM
class, the complete
method formats the prompt using the _completion_to_prompt
function if the formatted
parameter is set to False
.
In Jupyter, you might be directly passing the prompt without additional formatting, while the FastAPI endpoint might be invoking the complete
method with formatted
set to False
, causing the prompt to be reformatted.
To resolve this, ensure that the formatted
parameter is consistently set to True
when calling the complete
method if you do not want the prompt to be reformatted. Here is the relevant part of the code:
@llm_completion_callback()
def complete(
self, prompt: str, formatted: bool = False, **kwargs: Any
) -> CompletionResponse:
model_kwargs = {**self.model_kwargs, **kwargs}
if not formatted:
prompt = self._completion_to_prompt(prompt, self.system_prompt)
request_body = self.content_handler.serialize_input(prompt, model_kwargs)
response = self._client.invoke_endpoint(
EndpointName=self.endpoint_name,
Body=request_body,
ContentType=self.content_handler.content_type,
Accept=self.content_handler.accept,
**self.endpoint_kwargs,
)
response["Body"] = self.content_handler.deserialize_output(response["Body"])
text = self.content_handler.remove_prefix(response["Body"], prompt)
return CompletionResponse(
text=text,
raw=response,
additional_kwargs={
"model_kwargs": model_kwargs,
"endpoint_kwargs": self.endpoint_kwargs,
},
)
Ensure that when you call the complete
method, you set formatted=True
if you want to avoid additional prompt formatting:
response = sage_maker_llm_instance.complete(prompt, formatted=True)
This should help maintain consistency in prompt interpretation between Jupyter and the FastAPI endpoint [1].
To continue talking to Dosu, mention @dosu.
@dosu, could you give me the complete code for the pandad query engine that should work on fast API, with the prompt formatting you just mentioned above?
Here is the complete code for the Pandas Query Engine that should work with FastAPI, including the correct prompt formatting:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas import PandasQueryEngine
from llama_index.experimental.query_engine.pandas.output_parser import PandasInstructionParser
from llama_index.core.schema import QueryBundle
from llama_index.legacy.prompts import PromptTemplate
app = FastAPI()
# Load the CSV file
df = pd.read_csv("sample.csv")
# Initialize PandasQueryEngine with the DataFrame
query_engine = PandasQueryEngine(df=df, verbose=True)
# Update prompts if needed
new_prompt = PromptTemplate(
"""\
You are working with a pandas dataframe in Python.
The name of the dataframe is `df`.
This is the result of `print(df.head())`:
{df_str}
Follow these instructions:
{instruction_str}
Query: {query_str}
Expression: """
)
query_engine._update_prompts({"pandas_prompt": new_prompt})
class Query(BaseModel):
query: str
@app.post("/query")
async def process_query(query: Query):
try:
# Create a QueryBundle with the instruction
query_bundle = QueryBundle(query_str=query.query)
# Execute the query using the PandasQueryEngine
response = await query_engine._aquery(query_bundle) # Use the async method
# Parse the output using PandasInstructionParser
instruction_parser = PandasInstructionParser(df)
parsed_output = instruction_parser.parse(response.response)
return {"response": parsed_output}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=7000)
This code sets up a FastAPI application that uses the PandasQueryEngine
to process queries on a DataFrame loaded from a CSV file. The query
endpoint takes a query string as input and returns the response after processing it with the PandasInstructionParser
. The prompt formatting is updated to ensure consistent interpretation of the query.
Make sure to install the necessary dependencies and use Python 3.9 or higher for compatibility.
To continue talking to Dosu, mention @dosu.
@dosu, your code give below error { "detail": "1 validation error for LLMPredictStartEvent\ntemplate\n Can't instantiate abstract class BasePromptTemplate without an implementation for abstract methods 'format', 'format_messages', 'get_template', 'partial_format' (type=type_error)" }
The error you're encountering indicates that the BasePromptTemplate
class is being instantiated directly, which is not allowed because it contains abstract methods that need to be implemented.
To resolve this, ensure that you are using the correct PromptTemplate
class from the llama_index.core.prompts
module. Here is the corrected code:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas import PandasQueryEngine
from llama_index.experimental.query_engine.pandas.output_parser import PandasInstructionParser
from llama_index.core.schema import QueryBundle
from llama_index.core.prompts import PromptTemplate, PromptType
app = FastAPI()
# Load the CSV file
df = pd.read_csv("sample.csv")
# Initialize PandasQueryEngine with the DataFrame
query_engine = PandasQueryEngine(df=df, verbose=True)
# Define the default template string
DEFAULT_PANDAS_TMPL = (
"You are working with a pandas dataframe in Python.\n"
"The name of the dataframe is `df`.\n"
"This is the result of `print(df.head())`:\n"
"{df_str}\n\n"
"Follow these instructions:\n"
"{instruction_str}\n"
"Query: {query_str}\n\n"
"Expression:"
)
# Create the PromptTemplate instance
DEFAULT_PANDAS_PROMPT = PromptTemplate(
DEFAULT_PANDAS_TMPL, prompt_type=PromptType.PANDAS
)
# Update the prompt template in the query engine
query_engine._update_prompts({"pandas_prompt": DEFAULT_PANDAS_PROMPT})
class Query(BaseModel):
query: str
@app.post("/query")
async def process_query(query: Query):
try:
# Create a QueryBundle with the instruction
query_bundle = QueryBundle(query_str=query.query)
# Execute the query using the PandasQueryEngine
response = await query_engine._aquery(query_bundle) # Use the async method
# Parse the output using PandasInstructionParser
instruction_parser = PandasInstructionParser(df)
parsed_output = instruction_parser.parse(response.response)
return {"response": parsed_output}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=7000)
This code ensures that the PromptTemplate
is correctly instantiated and used within the PandasQueryEngine
. Make sure to replace "sample.csv"
with the path to your actual CSV file. This should resolve the error related to the abstract class instantiation [1][2].
To continue talking to Dosu, mention @dosu.
@dosu, My prompt was- "what was the first activity"
and I get below error
Pandas Instructions:
df['Activity'].iloc[0]
Traceback (most recent call last): File "C:\Users\sride\PycharmProjects\Test AI.venv\Lib\site-packages\pandas\core\indexes\base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas\_libs\hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Activity'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\sride\PycharmProjects\Test AI.venv\Lib\site-packages\llama_index\experimental\query_engine\pandas\output_parser.py", line 54, in default_output_processor
output_str = str(safe_eval(module_end_str, global_vars, local_vars))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sride\PycharmProjects\Test AI.venv\Lib\site-packages\llama_index\experimental\exec_utils.py", line 159, in safe_eval
return eval(__source, _get_restricted_globals(globals), locals)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "
Pandas Output: There was an error running the output as Python code. Error message: 'Activity' Traceback (most recent call last): File "C:\Users\sride\PycharmProjects\Test AI.venv\Lib\site-packages\llama_index\experimental\query_engine\pandas\output_parser.py", line 40, in default_output_processor tree = ast.parse(output) ^^^^^^^^^^^^^^^^^ File "C:\Users\sride\AppData\Local\Programs\Python\Python312\Lib\ast.py", line 52, in parse return compile(source, filename, mode, flags, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "
", line 1 There was an error running the output as Python code. Error message: 'Activity'
I don't know why the same code runs well with Jupyter without API,
The issue you're encountering is due to the difference in how the prompt is interpreted in Jupyter versus FastAPI. In Jupyter, the environment might be more interactive and flexible, allowing for better interpretation of the prompt. In contrast, FastAPI runs in a more controlled environment, which might lead to differences in how prompts are processed.
To ensure consistent prompt interpretation across both environments, you need to make sure that the prompt templates and the execution context are consistent. Here is an example of how you can set up the prompt templates and execution context in FastAPI to match the Jupyter environment:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas import PandasQueryEngine
from llama_index.core.prompts import PromptTemplate
from llama_index.core.schema import QueryBundle
app = FastAPI()
# Load the CSV file
df = pd.read_csv("sample.csv")
# Define the default instruction string and response synthesis prompt template
DEFAULT_INSTRUCTION_STR = (
"1. Convert the query to executable Python code using Pandas.\n"
"2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
"3. The code should represent a solution to the query.\n"
"4. PRINT ONLY THE EXPRESSION.\n"
"5. Do not quote the expression.\n"
)
DEFAULT_RESPONSE_SYNTHESIS_PROMPT_TMPL = (
"Given an input question, synthesize a response from the query results.\n"
"Query: {query_str}\n\n"
"Pandas Instructions (optional):\n{pandas_instructions}\n\n"
"Pandas Output: {pandas_output}\n\n"
"Response: "
)
DEFAULT_RESPONSE_SYNTHESIS_PROMPT = PromptTemplate(
DEFAULT_RESPONSE_SYNTHESIS_PROMPT_TMPL,
)
# Initialize the PandasQueryEngine with the same prompts and settings
query_engine = PandasQueryEngine(
df=df,
instruction_str=DEFAULT_INSTRUCTION_STR,
response_synthesis_prompt=DEFAULT_RESPONSE_SYNTHESIS_PROMPT,
verbose=True
)
class Query(BaseModel):
query: str
@app.post("/query")
async def process_query(query: Query):
try:
# Create a QueryBundle with the instruction
query_bundle = QueryBundle(query_str=query.query)
# Execute the query using the PandasQueryEngine
response = await query_engine._aquery(query_bundle) # Use the async method
return {"response": response.response}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=7000)
By ensuring that the prompt templates (DEFAULT_INSTRUCTION_STR
and DEFAULT_RESPONSE_SYNTHESIS_PROMPT
) and the execution context are the same in both Jupyter and FastAPI, you can achieve consistent prompt interpretation across both environments [1].
To continue talking to Dosu, mention @dosu.
@dosu, no luck, still the same error
@logan-markewich, it seems like this issue requires a deeper look, could you please lend your expertise?
To continue talking to Dosu, mention @dosu.
@logan-markewich , to explain the issue,
I'm encountering an issue with the pandas query engine behaving differently in various environments. Let me explain: The setup:
I have a simple pandas query engine code. My dataframe has a column named "Operation Name".
In Jupyter Notebook:
The code works perfectly. When I input the query "What was the first activity?", it correctly interprets "activity" as the "Operation Name" column. It searches the "Operation Name" column and returns the expected results.
However, in a .py file or when exposed as an API:
The same code doesn't work as expected. Using the same query "What was the first activity?", the query engine literally searches for an "Activity" column. Since there's no "Activity" column in my dataframe, it throws a KeyError. The error message is: "> Pandas Output: There was an error running the output as Python code. Error message: 'Activity'"
It seems the query engine's ability to map natural language to the correct column name is working in Jupyter, but failing in other environments. Do you have any insights on why this might be happening or how to resolve it?
Below is my code- I am using Azure OpenAI as an LLM( I am not giving LLM details here), but I have declared it in my code
`from llama_index.llms.azure_openai import AzureOpenAI from llama_index.experimental.query_engine.pandas.pandas_query_engine import PandasQueryEngine import pandas as pd
df= pd.read_csv("sample.csv")
query_engine = PandasQueryEngine(df=df,llm=llm,verbose=True,synthesize_response=True)
response = query_engine.query("What was the first activity") print(response.response)`
@dosu, @logan-markewich
Any help can be done on my issue pls....
A maintainer has already been looped in.
To continue talking to Dosu, mention @dosu.
A maintainer has already been looped in.
To continue talking to Dosu, mention @dosu.
@logan-markewich is it possible to help me on this?
@dosu, is it possible to get any update?
The issue you're experiencing, where the Pandas Query Engine works correctly in Jupyter but fails in a standalone Python script or API, is likely due to differences in how the environments handle the interpretation of natural language queries.
To address this, ensure that the column names used in your queries match exactly with the column names in your DataFrame. This includes checking for any leading or trailing spaces, case sensitivity, and special characters. Here is a step-by-step approach to debug and resolve the issue:
Here is an updated version of your code with additional debugging steps:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from llama_index.experimental.query_engine.pandas import PandasQueryEngine
from llama_index.core.schema import QueryBundle
from llama_index.llms.azure_openai import AzureOpenAI
app = FastAPI()
# Load the CSV file
df = pd.read_csv("sample.csv")
# Print DataFrame columns for debugging
print("DataFrame columns:", df.columns)
# Initialize the LLM (ensure you have the correct details in your actual code)
llm = AzureOpenAI(api_key="your_api_key", endpoint="your_endpoint", api_version="2023-05-15")
# Initialize PandasQueryEngine with the DataFrame and LLM
query_engine = PandasQueryEngine(df=df, llm=llm, verbose=True, synthesize_response=True)
class Query(BaseModel):
query: str
@app.post("/query")
async def process_query(query: Query):
try:
# Create a QueryBundle with the instruction
query_bundle = QueryBundle(query_str=query.query)
# Execute the query using the PandasQueryEngine
response = await query_engine._aquery(query_bundle) # Use the async method
return {"response": response.response}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=7000)
If the issue persists, consider the following:
By following these steps, you should be able to identify and resolve the discrepancies in query interpretation between Jupyter and the FastAPI environment.
A maintainer has already been looped in.
To continue talking to Dosu, mention @dosu.
Question Validation
Question
Below is my code
from fastapi import FastAPI, HTTPException from pydantic import BaseModel import pandas as pd from IPython.display import Markdown, display import time
import pandas as pd from llama_index.llms.azure_openai import AzureOpenAI from llama_index.experimental.query_engine import PandasQueryEngin
Initialize the PandasQueryEngine
query_engine = PandasQueryEngine(df=df, llm=llm, verbose=True, synthesize_response=True)
class Query(BaseModel): query: str
app = FastAPI()
Load the CSV file
df = pd.read_csv("sample.csv")
@app.post("/query") async def process_query(query: Query): try: response = query_engine.query(query.query) return {"response": response.response} except Exception as e: raise HTTPException(status_code=500, detail=str(e))
if name == "main": import uvicorn uvicorn.run(app, host="0.0.0.0", port=7000)
"query": "Give me the number of users in the document" } { "response": "The number of unique users in the document is [insert output from running the Pandas instruction]." }