pydantic / pydantic

Data validation using Python type hints
https://docs.pydantic.dev
MIT License
20.76k stars 1.86k forks source link

Generic model with constrained type parameter generates incorrect JSON Schema #7308

Open sisp opened 1 year ago

sisp commented 1 year ago

Initial Checks

Description

The generated JSON Schema for a generic model with multiple attributes using the same type parameter is not correct. Consider the example code below which generates the following JSON Schema:

{
  "type": "object",
  "title": "M",
  "required": ["x", "y"],
  "properties": {
    "x": {
      "title": "X",
      "anyOf": [
        {"type": "integer"},
        {"type": "string"}
      ]
    },
    "y": {
      "title": "Y",
      "anyOf": [
        {"type": "integer"},
        {"type": "string"}
      ]
    }
  }
}

But since x and y use the same type parameter T, their types are coupled, so anyOf for each property is not correct. Instead, the schema should be this:

{
  "type": "object",
  "title": "M",
  "required": ["x", "y"],
  "anyOf": [
    {
      "properties": {
        "x": {
          "type": "integer",
          "title": "X"
        },
        "y": {
          "type": "integer",
          "title": "Y"
        }
      }
    },
    {
      "properties": {
        "x": {
          "type": "string",
          "title": "X"
        },
        "y": {
          "type": "string",
          "title": "Y"
        }
      }
    }
  ]
}

I realize that this may be difficult to generalize to more complex cases, and it's partially JSON Schema's fault because it doesn't have first-class support for type parameters which results in very verbose schemas. Nevertheless, I felt it's worth sharing this observation.

Example Code

from typing import Generic, TypeVar

from pydantic import BaseModel
from pydantic.types import StrictInt, StrictStr

T = TypeVar("T", StrictInt, StrictStr)

class M(BaseModel, Generic[T]):
    x: T
    y: T

M.model_json_schema()

Python, Pydantic & OS Version

pydantic version: 2.3.0
        pydantic-core version: 2.6.3
          pydantic-core build: profile=release pgo=true
                 install path: .../.venv/lib/python3.9/site-packages/pydantic
               python version: 3.9.13 (main, Sep  2 2022, 22:36:50)  [GCC 9.4.0]
                     platform: Linux-5.13.0-48-generic-x86_64-with-glibc2.31
     optional deps. installed: ['typing-extensions']

Selected Assignee: @samuelcolvin

dmontagu commented 1 year ago

So, right now, when we get unparametrized typevars we handle it in pydantic core by treating a typevar with constraints as the union of the constraints, but as you have noted, that is not actually what is semantically meant.

I'll note that this actually isn't just an issue with the JSON schema — it's an issue with the actual validation:

from typing import Generic, TypeVar

from pydantic import BaseModel
from pydantic.types import StrictInt, StrictStr

T = TypeVar("T", StrictInt, StrictStr)

class M(BaseModel, Generic[T]):
    x: T
    y: T

print(M(x=1, y='a'))
#> x=1 y='a'

(No validation error, even though this is a type error.)

I think supporting this properly would involve some changes to pydantic core to actually support this validation. In particular, you have to have an understanding of the scope of the typevar, which I think we currently don't track in pydantic-core schemas. But if you do know the scope of the typevar, then you can make this work by converting a type with that typevar into a union over all of its constraints, and if we could do that, we'd get a proper JSON schema "for free".

It seems to me that it would be difficult to make it work, but I'm not totally sure. At the very least, I think we could make it work for TypeAliasType with a typevar with constraints. Specifically, if we changed this line: https://github.com/pydantic/pydantic/blob/2acf1aff55a56f5cfe41f286350399007e7e425a/pydantic/_internal/_generate_schema.py#L988 to do something a bit different if the typevars have constraints vs. bounds, I think it shouldn't be too much work. Longer term, there might be a code migration path for pydantic where we try to automatically wrap generic types in a type alias type to achieve the same behavior, but I'm perhaps getting a bit ahead of myself here.

I'd be open to a PR to make this work for TypeAliasType following the suggestion above, or could get to it myself eventually.


In the more realistic short term, depending on your usage, there might be some workarounds. One thing you could do is:

ConcreteM = Union[M[StrictInt] | M[StrictStr]]

which should at least result in the right JSON schema and even field validation, but is no longer actually a model, and won't automatically match all the constraints on the typevar

Another tricky thing that will generate the right JSON schema for the model (though won't validate according to it, sadly), is:

from typing import Generic, TypeVar, Union

from pydantic_core import CoreSchema

from pydantic import BaseModel, TypeAdapter, GetJsonSchemaHandler
from pydantic.json_schema import JsonSchemaValue
from pydantic.types import StrictInt, StrictStr

T = TypeVar("T", StrictInt, StrictStr)

class M(BaseModel, Generic[T]):
    x: T
    y: T

    @classmethod
    def __get_pydantic_json_schema__(
        cls,
        __core_schema: CoreSchema,
        __handler: GetJsonSchemaHandler,
    ) -> JsonSchemaValue:
        if cls.__mro__[0] == M:  # this prevents the infinite recursion 
            type_ = Union[tuple(M[x] for x in T.__constraints__)]
            __core_schema = TypeAdapter(type_).core_schema
        return __handler(__core_schema)

print(M.model_json_schema())
"""
{'$defs': {'M_Annotated_int__Strict_strict_True___': {'properties': {'x': {'title': 'X', 'type': 'integer'}, 'y': {'title': 'Y', 'type': 'integer'}}, 'required': ['x', 'y'], 'title': 'M[Annotated[int, Strict(strict=True)]]', 'type': 'object'}, 'M_Annotated_str__Strict_strict_True___': {'properties': {'x': {'title': 'X', 'type': 'string'}, 'y': {'title': 'Y', 'type': 'string'}}, 'required': ['x', 'y'], 'title': 'M[Annotated[str, Strict(strict=True)]]', 'type': 'object'}}, 'anyOf': [{'$ref': '#/$defs/M_Annotated_int__Strict_strict_True___'}, {'$ref': '#/$defs/M_Annotated_str__Strict_strict_True___'}], 'title': 'M'}
""""

which I think is essentially the JSON schema you want, at least ignoring the definitions. (Not sure if that's a problem; if so, maybe we could add better support for automatically in-lining definitions where possible.) That said, I hope we can eventually come up with a better pattern for this or support it properly even from pydantic-core.

sisp commented 1 year ago

Thanks for your prompt and detailed reply, @dmontagu! :bow: I hadn't noticed that the validation is also affected, which is much worse than the JSON Schema generation.

I'm not so familiar with Pydantic's internals and they appear quite non-trivial, it might be a steep learning curve for me to figure out how to make the right changes. Not sure if this makes sense TBH. :confused: