instructor-ai / instructor

structured outputs for llms
https://python.useinstructor.com/
MIT License
8.21k stars 656 forks source link

Validation Error on nested models with Partial Streaming #587

Closed pmalacho-mit closed 7 months ago

pmalacho-mit commented 7 months ago

What Model are you using?

Describe the bug

The use of Partial does not seem to prevent validation errors on nested models.

If you have the following:

class Child(BaseModel):
   dummy: str

class Parent(BaseModel):
   child: Child

And then attempt partial streaming (with Parent as your response model), a validation error will be thrown on the child property.

To Reproduce

If you take the demo code offered in the Streaming Partial Responses documentation, and change the user property to be a single User instead of a list of Users, for example:

class MeetingInfo(BaseModel):
    # users: List[User]
    users: User
    date: str
    location: str
    budget: int
    deadline: str

A validation error will be thrown on the users property:

pydantic_core._pydantic_core.ValidationError: 1 validation error for PartialMeetingInfo
users
  Input should be a valid dictionary or instance of PartialUser [type=model_type, input_value=[], input_type=list]
    For further information visit https://errors.pydantic.dev/2.7/v/model_type

Expected behavior

No validation error on use of nested models (apologies if that's an incorrect assumption).

Screenshots N/A

dosubot[bot] commented 7 months ago

The issue you're experiencing with validation errors when using Partial with nested models in your MeetingInfo example seems to stem from how the model's fields are being handled for partial validation, especially when changing from a list of models to a single model instance. When you modify the MeetingInfo model to use a single User instead of a list, it's crucial to ensure that the User model is correctly wrapped with Partial to allow for partial validation, and that the field is marked as optional.

Here's how you might adjust your model definition to address this issue:

from pydantic import BaseModel
from typing import Optional

class User(BaseModel):
    name: str
    email: str

class MeetingInfo(BaseModel):
    topic: str
    participant: Optional[Partial[User]]  # Assuming Partial is correctly implemented

In this adjusted example, participant is a single User model that is made optional and is correctly wrapped with Partial, facilitating partial validation. This adjustment ensures that the single User field is treated correctly for partial streaming scenarios.

For further implementation details, consider reviewing the Partial class and the _make_field_optional function within the instructor codebase, as these are key to handling partial streaming with nested models effectively.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
pmalacho-mit commented 7 months ago

Sorry @dosubot , I should've mentioned in my original post that that wrapping the field in Partial actually does not fix the problem -- only marking the property as optional fixes it.

I see here from this comment that it's actually an explicit decision to not mark those fields as optional though so that the AI will treat them as required (or at least so I assume, neat!): https://github.com/jxnl/instructor/blob/main/instructor/dsl/partial.py#L235

dosubot[bot] commented 7 months ago

It appears the core of the issue you're facing with partial streaming and validation errors in nested Pydantic models is tied to the handling of optional fields and the Partial wrapper's behavior. Despite wrapping the nested model with Partial, marking the property as optional seems to be the only effective solution you've found. This aligns with the explicit decision in the instructor codebase to not automatically mark these fields as optional to ensure they are treated as required by the AI.

To address this, you might consider explicitly marking nested model fields as optional in your BaseModel definitions, even when using Partial. This approach allows you to maintain the required validation behavior while also accommodating the partial data structures that are characteristic of streaming scenarios.

Here's a revised approach based on your findings and the instructor's implementation strategy:

from pydantic import BaseModel
from typing import Optional
from instructor.dsl.partial import Partial

class User(BaseModel):
    name: str
    email: str

class MeetingInfo(BaseModel):
    users: Optional[Partial[User]]
    date: str
    location: str
    budget: int
    deadline: str

In this example, users is explicitly marked as Optional and wrapped with Partial[User], which should help mitigate the validation errors you're encountering during partial streaming.

This method leverages both the flexibility of Partial for handling incomplete data and the explicit control over field optionality, providing a more robust solution for streaming scenarios. If further adjustments or clarifications are needed, feel free to reach out.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
jxnl commented 7 months ago

can you try doing it with

client = instructor.from_openai(OpenAI())

client.chat.completions.create_with_partial

pmalacho-mit commented 7 months ago

Thanks for the quick reply, @jxnl (and thanks for this awesome library).

Hm, that seems to result in the same error. One point of confusion, did you mean create_partial (so no _with_)?

If so, both of these still throw a validation error:

....

class MeetingInfo(BaseModel):
    users: User
    date: str
    location: str
    budget: int
    deadline: str

stream1 = client.chat.completions.create_partial(
    model="gpt-4",
    response_model=MeetingInfo,
    messages=[
        {
            "role": "user",
            "content": f"Get the information about the meeting and the users {text_block}",
        },
    ],
    stream=True,
)  # type: ignore

stream2 = client.chat.completions.create_partial(
    model="gpt-4",
    response_model=instructor.Partial[MeetingInfo],
    messages=[
        {
            "role": "user",
            "content": f"Get the information about the meeting and the users {text_block}",
        },
    ],
    stream=True,
)  # type: ignore
jxnl commented 7 months ago

Yes sorry

jxnl commented 7 months ago

I'll try to look at this but quite busy right now. I'd try to make everything optional for now.

jxnl commented 7 months ago
from pydantic import BaseModel

from openai import OpenAI
import instructor

client = OpenAI()

client = instructor.from_openai(client)

class User(BaseModel):
    name: str
    email: str

class MeetingInfo(BaseModel):
    user: User
    date: str
    location: str
    budget: int
    deadline: str

data = """
Jason Liu jason@gmail.com
Meeting Date: 2024-01-01
Meeting Location: 1234 Main St
Meeting Budget: $1000
Meeting Deadline: 2024-01-31
"""
stream1 = client.chat.completions.create_partial(
    model="gpt-4",
    response_model=MeetingInfo,
    messages=[
        {
            "role": "user",
            "content": f"Get the information about the meeting and the users {data}",
        },
    ],
    stream=True,
)  # type: ignore

for message in stream1:
    print(message)
"""
ser={} date=None location=None budget=None deadline=None
user={} date=None location=None budget=None deadline=None
user={} date=None location=None budget=None deadline=None
user={} date=None location=None budget=None deadline=None
user=PartialUser(name=None, email=None) date=None location=None budget=None deadline=None
user=PartialUser(name=None, email=None) date=None location=None budget=None deadline=None
user=PartialUser(name=None, email=None) date=None location=None budget=None deadline=None
user=PartialUser(name=None, email=None) date=None location=None budget=None deadline=None
user=PartialUser(name=None, email=None) date=None location=None budget=None deadline=None
user=PartialUser(name=None, email=None) date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email=None) date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email=None) date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email=None) date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email=None) date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email=None) date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email=None) date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email=None) date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email=None) date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email=None) date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date=None location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location=None budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=None deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=100 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline=None
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline='2024-01-31'
user=PartialUser(name='Jason Liu', email='jason@gmail.com') date='2024-01-01' location='1234 Main St' budget=1000 deadline='2024-01-31'
"""

this works.

pmalacho-mit commented 7 months ago

@jxnl thanks, I think I know what's actually going on here (and it's mostly 'user error') -- seems like keeping the property name as users combined with the prompt that has multiple users causes the LLM to only want to specify a list for the entry. Interestingly, your updated prompt with only a single user specified works even when the property is named users (I guess the LLM can intuit that property is just named poorly).

This is consistent with the original error, which, re-reading, makes it clear enough that the model was trying to stuff a [] where it didn't belong: Input should be a valid dictionary or instance of PartialUser [type=model_type, input_value=[], input_type=list].

I independently ran into this issue on my own code, so I assume I must've done something similar (named something as a singular when expecting a list, or vice versa).

Some good LLM learning! Thanks for working with me to debug.

jxnl commented 7 months ago

I hate llms, probably true lol

jxnl commented 7 months ago

its almost too smart, but agree it feels like a funny face of user error?? i deleted the s without thinking