langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.13k stars 15.21k forks source link

Parsing a dictionary of lists problem #27453

Open snassimr opened 1 week ago

snassimr commented 1 week ago

Checked other resources

Example Code

I want to get "a" as a key in ppp , but code (using Dict) below fails:

 import os
  from pydantic import BaseModel, Field
  from langchain_openai import ChatOpenAI

  model = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=0.0)

  class A(BaseModel):
    a_1: str
    a_2: str
    r: str

  class B(BaseModel):
    b_1: str
    b_2: str
    r: str

  class C(BaseModel):
    ccc:List[A]
    ppp: Dict[str, List[B]]

  structured_llm = model.with_structured_output(C)

  response = structured_llm.invoke(prompt)

Error Message and Stack Trace (if applicable)

ValidationError: 1 validation error for C ppp Field required [type=missing, input_value={'ccc': [{'a_1': 'Price',...tant to Battery Life'}]}, input_type=dict] For further information visit https://errors.pydantic.dev/2.9/v/missing

Description

I have a code that works:

import os from pydantic import BaseModel, Field from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=0.0)

class A(BaseModel): a_1: str a_2: str r: str

class B(BaseModel): a: str b_1: str b_2: str r: str

class C(BaseModel): ccc:List[A] ppp: List[B]

structured_llm = model.with_structured_output(C)

response = structured_llm.invoke(prompt)

System Info

System Information

OS: Linux OS Version: #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024 Python Version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]

Package Information

langchain_core: 0.2.41 langchain: 0.2.16 langchain_community: 0.2.17 langsmith: 0.1.136 langchain_openai: 0.1.21 langchain_text_splitters: 0.2.4

Optional packages not installed

langgraph langserve

Other Dependencies

aiohttp: 3.10.10 async-timeout: 4.0.3 dataclasses-json: 0.6.7 httpx: 0.27.2 jsonpatch: 1.33 numpy: 1.26.4 openai: 1.52.0 orjson: 3.10.7 packaging: 24.1 pydantic: 2.9.2 PyYAML: 6.0.2 requests: 2.32.3 requests-toolbelt: 1.0.0 SQLAlchemy: 2.0.35 tenacity: 8.5.0 tiktoken: 0.8.0 typing-extensions: 4.12.2

eyurtsev commented 1 week ago

Just a quick glance -- but this does not appear to be a bug in langchain, but an issue with the chat model failing to produce the correct output

I'd suggest adding reference examples into the prompt to help the model output the correct thing

ethanglide commented 1 week ago

For some reason, it seems that every time the LLM is prompted with generating some type of dictionary, it is not included in the response.

Consider this simple code and some variations:

from typing import Dict, List
from pydantic import BaseModel
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=0.0)

class TestModel(BaseModel):
    # variations here

structured_llm = model.with_structured_output(TestModel)

response = structured_llm.invoke('prompt')
print(response)

When we define TestModel as follows:

output: int

Here is the output (I am also outputting the openai tool that gets bound to the model):

{'type': 'function', 'function': {'name': 'TestModel', 'description': '', 'parameters': {'properties': {'output': {'type': 'integer'}}, 'required': ['output'], 'type': 'object'}}}
output=5

Even if we define TestModel like this:

output: List

We get this:

{'type': 'function', 'function': {'name': 'TestModel', 'description': '', 'parameters': {'properties': {'output': {'items': {}, 'type': 'array'}}, 'required': ['output'], 'type': 'object'}}}
output=['What is your favorite book and why?', 'If you could travel anywhere in the world, where would you go and what would you do there?', 'What is a skill you would like to learn and why?', 'Describe a memorable experience you had in the past year.', 'If you could have dinner with any historical figure, who would it be and what would you ask them?']

But as soon as TestModel gets defined as so:

output: Dict

Then all of the sudden the model does not respond with anything!

{'type': 'function', 'function': {'name': 'TestModel', 'description': '', 'parameters': {'properties': {'output': {'type': 'object'}}, 'required': ['output'], 'type': 'object'}}}
pydantic_core._pydantic_core.ValidationError: 1 validation error for TestModel
output
  Field required [type=missing, input_value={}, input_type=dict]

If there is a Dict somwhere, as well as other keys, then those other keys will be included in the output but not the dictionary:

output: Dict
output_2: int

Gives:

{'type': 'function', 'function': {'name': 'TestModel', 'description': '', 'parameters': {'properties': {'output': {'additionalProperties': {'type': 'integer'}, 'type': 'object'}, 'output_2': {'type': 'integer'}}, 'required': ['output', 'output_2'], 'type': 'object'}}}
pydantic_core._pydantic_core.ValidationError: 1 validation error for TestModel
output
  Field required [type=missing, input_value={'output_2': 5}, input_type=dict]

Why are these fields getting ignored? Is this an issue with the model or what?

snassimr commented 1 week ago

Actually I found a format that works more suitable for me. Anyway I can turn from on format to Dict with one line of Python code. @ethanglide does it work with examples and not just 'prompt' string ? I don't have too much experience providing examples for the case.

ethanglide commented 1 week ago

Unfortunately I am not able to get it to work with examples either, assuming I did those correctly.

from typing import Dict, List
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate

model = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=0.0)

class TestModel(BaseModel):
    output: Dict[str, str]

structured_llm = model.with_structured_output(TestModel)

examples = [
    {
        "input": "What is the capital of France?",
        "output": '{"output": "Paris"}'
    },
    {
        "input": "What is the capital of Germany?",
        "output": '{"output": "Berlin"}'
    },
    {
        "input": "What is the capital of Italy?",
        "output": '{"output": "Rome"}'
    }
]

example_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
        ("ai", "{output}"),
    ]
)

few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)

final_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a geography expert."),
        few_shot_prompt,
        ("human", "{input}"),
    ]
)

chain = final_prompt | structured_llm

response = chain.invoke({"input": "What is the capital of Lithuania?"})
print(response)

This gives:

{'type': 'function', 'function': {'name': 'TestModel', 'description': '', 'parameters': {'properties': {'output': {'additionalProperties': {'type': 'string'}, 'type': 'object'}}, 'required': ['output'], 'type': 'object'}}}
pydantic_core._pydantic_core.ValidationError: 1 validation error for TestModel
output
  Field required [type=missing, input_value={}, input_type=dict]

Which stays consistent with the issues I had above.

Of course, there are ways around this, and we should be asking ourselves whether or not having the model respond with arbitrarily structured dictionaries with arbitrary amounts of keys is something that should be done. But it really is strange that it works with Lists (they will even respond with lists of arbitrary size with arbitrary objects if you don't specify what kind of list it is) and not with Dicts.

ethanglide commented 1 week ago

@eyurtsev what do you think about the above?

eyurtsev commented 1 week ago

The examples should be ai messages with tool calls not just content since you're using the tool calling API. Check the how to guides for tool calling (apologies on 📱 right now)

Should look like

System, human, ai, tool, human, ai, tool

Or else squeeze the examples into the system prompt

ethanglide commented 1 week ago

Thank you for the guidance, I haven't quite been able to make it work but I'm sure it is possible.

Program:

from typing import Dict
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from langchain_core.runnables import RunnablePassthrough

model = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=0.0)

class TestModel(BaseModel):
    output: Dict

structured_llm = model.with_structured_output(TestModel)

examples = [
    HumanMessage("What is the capital of France?"),
    AIMessage(
        '',
        name='geography_assistant',
        tool_calls=[
            {
                'name': 'TestModel',
                'args': {"capital": "Paris"},
                'id': '1'
            },
        ],
    ),
    ToolMessage({"output": {"capital": "Paris"}}, tool_call_id='1'),
    HumanMessage("What is the capital of Germany?"),
    AIMessage(
        '',
        name='geography_assistant',
        tool_calls=[
            {
                'name': 'TestModel',
                'args': {"capital": "Berlin"},
                'id': '2'
            },
        ],
    ),
    ToolMessage({"output": {"capital": "Berlin"}}, tool_call_id='2'),
    HumanMessage("What is the capital of Italy?"),
    AIMessage(
        '',
        name='geography_assistant',
        tool_calls=[
            {
                'name': 'TestModel',
                'args': {"capital": "Rome"},
                'id': '3'
            },
        ],
    ),
    ToolMessage({"output": {"capital": "Rome"}}, tool_call_id='3'),
]

final_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a geography expert."),
        *examples,
        ("human", "{input}"),
    ]
)

chain = {'input': RunnablePassthrough()} | final_prompt | structured_llm

response = chain.invoke('What is the capital of Lithuania?')
print(response)

Output:

pydantic_core._pydantic_core.ValidationError: 1 validation error for TestModel
output
  Field required [type=missing, input_value={'capital': 'Vilnius'}, input_type=dict]

As you can see dicts are now being input to the tool which is great. Its not what I need but I am sure that with enough toying around and using a more real-world example I would be able to make this work. But the whole problem is just that you are not able to simply put Dict as the type and call it a day, the model will not respond with arbitrary objects it will try to just pass strings. Up to you to determine whether or not that is a real issue, I doubt it would get in the way of anyone's development.