langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
89.29k stars 14.07k forks source link

ChatOpenAI isn't compatible with create_pandas_dataframe_agent #1931

Closed sundar7D0 closed 1 year ago

sundar7D0 commented 1 year ago

I tried creating a pandas dataframe agent (using create_dataframe_agent) with ChatOpenAI from promptlayer or otherwise just plain version (ChatOpenAI) as the LLM! But langchain isn't able to parse the LLM's output code. If I modify some regular expression manually, it works (but again fails if the code is a single line, etc.). Basically the regex and parsing logic for figuring out Action, Action_input for chat models needs to be changed. This probably means lot of other existing chains with ChatOpenAI as LLM are up for a toss!

Code:

  chat = PromptLayerChatOpenAI(pl_tags=["langchain"])
  df_agent = create_pandas_dataframe_agent(chat, haplo_insuf, verbose=True) 
  df_agent.run("find the maximum number of employees working in any company based on the range. a-b means minimum a, maximum b!")

Output:

Entering new AgentExecutor chain...
Thought: We need to extract the minimum and maximum number of employees from the "df" dataframe and find the maximum value.
Action: python_repl_ast
Action Input:```max_emp = df['# Employees'].str.split('-', expand=True).astype(int).max(axis=1)max_emp.max()```
Observation: invalid syntax (<unknown>, line 2)
Thought:There seems to be a syntax error in the code. Let me fix it.
Action: python_repl_ast
Action Input:```max_emp = df['# Employees'].str.split('-', expand=True).astype(int).max(axis=1)max_emp.max()```
Observation: invalid syntax (<unknown>, line 2)
Thought:...
rishibh7 commented 1 year ago

+1

djpecot commented 1 year ago

+1

MohammedAlhajji commented 1 year ago

I think this might actually, I think this is a general problem with gpt-3.5-turbo if instead you try gpt-4 it will work.

gpt-3.5-turbo adds backticks in the action input which causes the syntax error. I'm trying to fix this behavior but I haven't managed to yet

pchng commented 1 year ago

Based on my understanding this affects more than just create_dataframe_agent and could affect agent/tool using PythonREPL and not "sanitizing" input to it. It seems that sanitization is left up to each caller of PythonREPL and such "sanitization" logic is duplicated in at least a few places.

  1. As a potential counter-example, LLMMathChain does sanitize/strip backticks from the input. So in theory this should work fine with gpt-3.5-turbo/ChatOpenAI
  2. PythonAstREPLTool does not appear to sanitize the input. This is what is used by create_pandas_dataframe_agent and is likely the source of the issue here.
  3. Note: PythonREPLTool, which is defined right above PythonAstREPLTool, does sanitze/strip out the triple backticks.

(Note main motivation for wanting to use gpt-3.5-turbo over text-davinci-003 is the much lower price at this time)

However even for LLMMathChain, sometimes it still has issues with sanitizing. Here is one run I've seen fail, though it strangely doesn't always fail: (Using ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')) (Note the answer would have been wrong anyways...)

agent.run("Calculate the population of the USA if the entire country had a population density equivalent to Wyoming")

> Entering new AgentExecutor chain...
I should use a calculator for this question
Action: Calculator
Action Input: Wyoming population (578,759) and USA land area (3,796,742 square miles)
Observation: Answer: 7.621784677494547

Thought:That's the population density of Wyoming, not the total population of the USA. I should use a search engine to find the total population of Wyoming.
Action: Search
Action Input: "Wyoming population"
Observation: Wyoming is a state in the Mountain West subregion of the Western United States. It is bordered by Montana to the north and northwest, South Dakota and Nebraska to the east, Idaho to the west, Utah to the southwest, and Colorado to the south.
Thought:Oops, I need to refine my search query to find the population
Action: Search
Action Input: "population of wyoming"
Observation: Wyoming is a state in the Mountain West subregion of the Western United States. It is bordered by Montana to the north and northwest, South Dakota and Nebraska to the east, Idaho to the west, Utah to the southwest, and Colorado to the south.
Thought:That didn't work. I should try a different search query.
Action: Search
Action Input: "current population of wyoming"
Observation: Wyoming is a state in the Mountain West subregion of the Western United States. It is bordered by Montana to the north and northwest, South Dakota and Nebraska to the east, Idaho to the west, Utah to the southwest, and Colorado to the south.
Thought:This is strange, the search engine isn't giving me the information I need. Let me try a different one.
Action: Search
Action Input: "population of wyoming 2021"
Observation: Wyoming is a state in the Mountain West subregion of the Western United States. It is bordered by Montana to the north and northwest, South Dakota and Nebraska to the east, Idaho to the west, Utah to the southwest, and Colorado to the south.
Thought:That still didn't work. Maybe I should try a different source.
Action: Search
Action Input: "wyoming population census bureau"
Observation: Population (up 7.4% to 331.4 million). Race and ethnicity (White alone 61.6%; Black alone 12.4%; Hispanic 18.7%; Asian alone 6%; American Indian ...
Thought:Ok, now I have the population of Wyoming and the population density equivalent to Wyoming. I can use the calculator to find the total US population.
Action: Calculator
Action Input: Wyoming population (578,759) and population density equivalent (5.840 people per square mile) and USA land area (3,796,742 square miles)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-4-8b759719f91c>](https://localhost:8080/#) in <cell line: 4>()
      2 # Also sometimes outputs the WRONG answer; one time it just gave the incorrect answer of the USA population!
      3 # Correct answer should be: ~22,212,450
----> 4 agent.run("Calculate the population of the USA if the entire country had a population density equivalent to Wyoming")

12 frames
[/usr/local/lib/python3.9/dist-packages/langchain/chains/llm_math/base.py](https://localhost:8080/#) in _process_llm_result(self, t)
     66             answer = "Answer: " + t.split("Answer:")[-1]
     67         else:
---> 68             raise ValueError(f"unknown format from LLM: {t}")
     69         return {self.output_key: answer}
     70 

ValueError: unknown format from LLM: What is Wyoming's land area in square miles?

    ```python
    wyoming_pop_density = 5.84
    wyoming_pop = 578759
    wyoming_area = wyoming_pop / wyoming_pop_density
    print(wyoming_area)


(NOTE: I had to indent the last few lines by 4 spaces to get the backticks to render properly, there was no indentation on the actual output)
Architectshwet commented 1 year ago

@sundar7D0 , does this issue got resolved? Chat model working with pandas agent?

sundar7D0 commented 1 year ago

I think with some fixes as suggested here and in other issues to resolve the parsing bottleneck, now chat model works with pandas agent!

brayniacapp commented 1 year ago

issues to resolve the parsing bottleneck

Hey @sundar7D0 ! Any chance you can point out the other issues? 🙏

sundar7D0 commented 1 year ago

Please check #1358 thread

mrothroc commented 1 year ago

I am still experiencing this on v0.0.163. I don't have access to gpt4, unfortunately. I don't see a manual workaround here in the thread. Is the fix supposed to be in the repo already?

adamcolon commented 1 year ago

I solved this problem by Using AgentType.OPENAI_FUNCTIONS: create_pandas_dataframe_agent(ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo'), df, verbose=True, agent_type=AgentType.OPENAI_FUNCTIONS)