openai / openai-python

The official Python library for the OpenAI API
https://pypi.org/project/openai/
Apache License 2.0
22.89k stars 3.2k forks source link

Apply more fixes for Pydantic schema incompatibilities with OpenAI structured outputs #1659

Open mcantrell opened 2 months ago

mcantrell commented 2 months ago

Confirm this is a feature request for the Python library and not the underlying OpenAI API.

Describe the feature or improvement you're requesting

I noticed that you guys are doing some manipulation of Pydantic's generated schema to ensure compatibility with the API's schema validation. I found a few more instances that can be addressed:

Issues:

The test cases below builds on your to_strict_json_schema function and removes addresses these problematic fields with the remove_property_from_schema function:

class Publisher(BaseModel):
    name: str = Field(description="The name publisher")
    url: Optional[str] = Field(None, description="The URL of the publisher's website")
    class Config:
        json_schema_extra = {
            "additionalProperties": False
        }

class Article(BaseModel):
    title: str = Field(description="The title of the news article")
    published: Optional[datetime] = Field(None, description="The date the article was published. Use ISO 8601 to format this value.")
    publisher: Optional[Publisher] = Field(None, description="The publisher of the article")
    class Config:
        json_schema_extra = {
            "additionalProperties": False
        }

class NewsArticles(BaseModel):
    query: str = Field(description="The query used to search for news articles")
    articles: List[Article] = Field(description="The list of news articles returned by the query")
    class Config:
        json_schema_extra = {
            "additionalProperties": False
        }

def test_schema_compatible():
    client = OpenAI()

    # build on the internals that the openai client uses to clean up the pydantic schema for the openai API
    schema = to_strict_json_schema(NewsArticles)

    # optional fields with pydantic defaults generate an unsupported 'default' field in the schema
    remove_property_from_schema(schema, "default")
    # date fields generate a format='date-time' field in the schema which is not supported
    remove_property_from_schema(schema, "format")

    logger.info("Generated Schema: %s", json.dumps(schema, indent=2))
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        temperature=0,
        messages=[
            {
                "role": "user",
                "content":  "What where the top headlines in the US for January 6th, 2021?",
            }
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "schema": schema,
                "name": "NewsArticles",
                "strict": True,
            }
        }
    )
    result = NewsArticles.model_validate_json(completion.choices[0].message.content)
    assert result is not None

def remove_property_from_schema(schema: dict, property_name: str):
    if 'properties' in schema:
        for field_name, field in schema['properties'].items():
            if 'properties' in field:
                remove_property_from_schema(field, property_name)
            if 'anyOf' in field: 
                for any_of in field['anyOf']:
                    any_of.pop(property_name, None)
            field.pop(property_name, None)
    if '$defs' in schema:                    
        for definition_name, definition in schema['$defs'].items():
            remove_property_from_schema(definition, property_name)

Additional context

No response

micahstairs commented 2 months ago

@RobertCraigie Thanks for fixing one of the issues! Do you have an ETA on the fix for the "format" issue?

RobertCraigie commented 1 month ago

There are currently no plans to automatically remove "format": "date-time" as it breaks .parse()'s promise that it will either generate valid data or refuse to generate any data.

We're considering opt-in flags to remove certain features that the API doesn't support yet but I don't have an ETA to share unfortunately.