pydantic / pydantic

Data validation using Python type hints
https://docs.pydantic.dev
MIT License
20.75k stars 1.86k forks source link

Coercion from datetime at midnight to date is incompatible with JSON schema #7065

Closed davidhewitt closed 1 year ago

davidhewitt commented 1 year ago

Initial Checks

Description

In Pydantic V2 we allowed date to be constructed from a timestamp string if the timestamp is exactly at midnight.

In https://github.com/pydantic/pydantic/issues/7039#issuecomment-1671859377 @MarkusSintonen makes some compelling arguments why this was a mistake:

On a similar vein, we also coerce datetime to a date if the datetime is midnight.

I think we should consider classifying these coercions as a bug and reverting before it is too late.

If users want this behaviour, they can implement a "before" validator which does these coercions with Annotated.

Example Code

>>> from pydantic import TypeAdapter
>>> from datetime import date, datetime

>>> TypeAdapter(date).validate_python("2020-01-01T00:00:00")
datetime.date(2020, 1, 1)

>>> TypeAdapter(date).validate_python(datetime(2021, 1, 1))
datetime.date(2021, 1, 1)

Python, Pydantic & OS Version

pydantic version: 2.1.1
        pydantic-core version: 2.4.0
          pydantic-core build: profile=release pgo=true mimalloc=true
                 install path: /home/david/.virtualenvs/pydantic-prod/lib/python3.11/site-packages/pydantic
               python version: 3.11.4 (main, Jun  9 2023, 07:59:55) [GCC 12.3.0]
                     platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.37
     optional deps. installed: ['email-validator', 'typing-extensions']

Selected Assignee: @Kludex

davidhewitt commented 1 year ago

I spoke with @samuelcolvin about this briefly. Because removing this makes things stricter, which would be a breaking change, this is probably too late for V2. We should keep this issue open to consider if we want to revisit in V3.

In the meanwhile, users of V2 have a couple of solutions:

>>> TypeAdapter(Annotated[date, Strict]).validate_json(b'"2020-01-01T00:00:00"')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/david/dev/pydantic/pydantic/pydantic/type_adapter.py", line 223, in validate_json
    return self.validator.validate_json(__data, strict=strict, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for date
  Input should be a valid date in the format YYYY-MM-DD, unexpected extra characters at the end of the input [type=date_parsing, input_value='2020-01-01T00:00:00', input_type=str]
    For further information visit https://errors.pydantic.dev/2.1/v/date_parsing

I'd be open to ideas how we make these solutions as ergonomic as possible.

MarkusSintonen commented 1 year ago

Honestly I think this behaviour in V2 is bad. Its losing information and its a big breaking change from V1 (I think its going to a worse direction in date handling). I would like to include to the list of problems also the loss of TZ information:

class A(BaseModel):
    a: date | datetime

# WHAT...
# datetime.date(2023, 1, 1)
print(repr(A.model_validate({"a": "2023-01-01T00:00:00+01:00"}).a))

Even if above bug with TZ losing would be fixed (to only coerce undefined TZs) it would be still a very surprising behaviour that it suddenly goes to a date instance.

samuelcolvin commented 1 year ago

I'm not sure we talking about the same thing.

In strict mode when validation python, any string regardless of format should be invalid as an input to both date and datetime. I understand there's a bug here and it definitely needs to be fixed.

In lax mode, I would argue that datetime's at midnight should be allowed as inputs to date - just as .0 floats are allowed as input to int fields - the idea being that no data has been lost.

With your above union example, the problem is that the string is currently being validated to a date in the first (strict) union pass - the above fix should solve this case.

Does this make sense?

MarkusSintonen commented 1 year ago

the above fix should solve this case.

json mode will reject timestamps in strict mode:

That does not work eg in FastAPI context.

users can define DateNoTS = Annotated[date, BeforeValidator(reject_timestamps)] for some suitable implementation of reject_timestamps

That feels a bit like bandaid work compared to how it used to work in V1. Its quite hard to remember to add that to basically all date usages to not allow (buggy) coercion to happen. Eg it means all FastAPI params require that to not allow users to accidentally pass timestamps (with TZs) in APIs. Otherwise it would always mean a weird API design all over the place with eg timezoned timestamps being converted to dates when it used to be eg validation error.

adriangb commented 1 year ago

In lax mode, I would argue that datetime's at midnight should be allowed as inputs to date - just as .0 floats are allowed as input to int fields - the idea being that no data has been lost.

I have to disagree with this, for various reasons. For starters a float is a subtype of an int so anything that accept a float generally also accepts an int, and if the float ended in .0 the mathematical outcome will be the same. The same thing cannot be said for date / datetime.

samuelcolvin commented 1 year ago

float is not a subtype of int.

The rule we're trying to stick to in V2 is "in lax mode, coerce if there's one obvious equivalent in the field type, and no data is lost".

The problem is that v1 (wrongly) allowed datetime as input's to date, although it didn't allow datetime formatted strings as inputs to date. All of which means it's not clear how to be somewhat compatible with V1 without equally inconsistent behaviour.

The other problem is that fastapi doesn't use validate_json, hence users are effectively forced to use lax mode with fastapi - there's no way to submit a valid input to a date field if you're effectively doing Model.model_validate(json.loads(data)).

If @MarkusSintonen didn't have this constraint, I think he would just use strict on his model or at least this field and the problem would be solved.


As a workaround we could add a config flag to prevent all datetimes as inputs to a date field, default off. That would fix the problem for users that want that, without changing current behaviour.

WDYT?

MarkusSintonen commented 1 year ago

just as .0 floats are allowed as input to int fields The rule we're trying to stick to in V2 is "in lax mode, coerce if there's one obvious equivalent in the field type, and no data is lost".

Comparing numeric value handling with datetime parsing rules doesnt make much sense to me. Why do you think its a good idea to convert ISO datetimes (at midnights) with TZs into date instances? V2s strange ISO datetime -> date coercion loses now information that you said "lax" mode avoids. This really feels like a bug rather than a good feature:

class A(BaseModel):
    a: date | datetime

# Fails with V2, works in V1
# AssertionError: assert datetime.date(2023, 1, 1) == datetime.datetime(2023, 1, 1, 0, 0, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))
assert A(a="2023-01-01T00:00:00+02:00").a == datetime(2023, 1, 1, tzinfo=timezone(timedelta(hours=2)))

The other problem is that fastapi doesn't use validate_json, hence users are effectively forced to use lax mode with fastapi As a workaround we could add a config flag to prevent all datetimes as inputs to a date field, default off. That would fix the problem for users that want that, without changing current behaviour.

I dont think that is a valid point or workaround as there is much more of parameter parsing in FastAPI that is not anything related to json. See trivial example below which is now broken in V2. Adding yet more parameters to configs doesnt feel right they are not really usable in many contexts. This is especially bad if the lossy conversion is on by default. It would be horrible usage ergonomics if you have to always remember to opt-out from the current behaviour all over the place.

def test_fastapi():
    async def handler(v: date):
        return {"v": v}

    app = FastAPI()
    app.router.add_api_route("/foo", handler)

    # API accepts date https://json-schema.org/understanding-json-schema/reference/string.html#dates-and-times
    assert app.openapi()["paths"]["/foo"]["get"]["parameters"][0]["schema"] == {"format": "date", "title": "V", "type": "string"}

    client = TestClient(app)
    res = client.get("/foo", params={"v": "2023-01-01T00:00:00"})  # Invalid value per OpenAPI spec
    # In Pydantic V2 this test fails as it gives an incorrect successful response...
    # AssertionError: Got: {'v': '2023-01-01'}
    # assert 200 == 422
    assert res.status_code == 422, f"Got: {res.json()}"

Not really an issue in FastAPI but in PydanticV2.

MarkusSintonen commented 1 year ago

I would like to add that every FastAPI app out there is now broken in validation with any date related APIs when using PydanticV2 because of this issue. It is really not acceptable to allow these kind of values for dates as its in clear violation of OpenAPI specification.

Ofcourse PydanticV2 violates OpenAPI and json schema specifictions for stringlike numerics coercion. Which should be rejected per those specifications. But its less of an problem compared to ISO datetime handling in PydanticV2 which feels now quite broken (eg with anything having TZs)

samuelcolvin commented 1 year ago

@MarkusSintonen I get where you're coming from and I do want to find the right solution - I agree what we have now is broken.

However I think you're being a bit hard line, let me explain.

Why do you think its a good idea to convert ISO datetimes (at midnights) with TZs into date instances?

I'm not saying anything specific about ISO timestamps. Consider the following two principles:

1

"if A is a valid input to B and B is a valid input to C, then A should be a valid input to C"

I'm sure we're not absolutely consistent about that principle, but that's one of the principles we're trying to aim for.

2

In lax mode: If a field is of type X, then subclasses of X should be allowed as input provided no information is lost by upcasting to an exact instance of X

Again, I'm sure we're not perfect about following this rule, but I do think in general it makes sense.

It's also consistent with Python typing if a function is decorated (x: X) then any instance of X (including instances of subclasses) would be allowed as input.

Effectively this is just saying the pydantic validation is Covariant (I think, I don't pretend to know very much about typing theory)


Putting all that together would imply that:

  1. datetime(2023, 1, 1) should be a valid input to a field of type date - it's an instance of date and has no extra information
  2. '2023-01-01T00:00:00' should be a valid input to a field of type date since it would be a valid input to a datetime field

Regarding parmeters and FastAPI, I would argue that they too should be validated in a similar way to JSON, we already have an implementation of the Input trait for String in pydantic-core - we should investigate whether it could be altered and exposted to be usable for any string input - e.g. URL arguments, GET arguments and data from multipart form bodies - basically any scenario where the input starts of as a string from.

MarkusSintonen commented 1 year ago

datetime(2023, 1, 1) should be a valid input to a field of type date - it's an instance of date and has no extra information

That is kinda true but need to be careful as also this is true in Python isinstance(datetime(2023, 1, 1, 1, 1), date) is True. So maybe this is fine A(f=datetime(2023, 1, 1)) where f: date

'2023-01-01T00:00:00' should be a valid input to a field of type date since it would be a valid input to a datetime field

That feels wrong as it is not correctly formatted ISO date. Also its not parseable by date.fromisoformat so its inconsistent how you work with ISO dates in Python. So this feels off A(f='2023-01-01T00:00:00') where f: date. Especially off if there is TZ part there. But it still feels off if it would work without TZ part.