pydantic / pydantic-core

Core validation logic for pydantic written in rust
MIT License
1.34k stars 207 forks source link

Proposal for mapping Python types to CombinedValidator #1337

Open adriangb opened 2 weeks ago

adriangb commented 2 weeks ago

Currently because of configs it is not possible to map a type to a SchemaValidator or CoreSchema. Consider:

class Inner(BaseModel):
    x: dict[str, list[float]]

    model_config = ConfigDict(allow_inf_nan=False)

class Outer(BaseModel):
    inner: Inner
    x: dict[str, list[float]]

    model_config = ConfigDict(allow_inf_nan=True)

Because of this we can’t just have a type_cache: dict[type, CoreSchema] when we build a schema from types in Pydantic: although dict[str, list[float]] shows up in two places the actual schema differs because of the model config.

To get around this I propose that we establish a rule: “every validator must be derivable from the type and only the type”. That means that all of the things coming from configs have to live elsewhere. For this I propose we (1) move all config things to ValidationState and (2) introduce a validator that mutates ValidationState to set the current config.

For example, we’d have {‘type’: ‘config’, ‘config’: {‘allow_inf_nan’: True}, ‘schema’: {…}}. At runtime this would mutate the validation context. And FloatValidator would pull that configuration from the context instead of storing it on the struct. Now both CoreSchema::Float and FloatValidator are immutable respect to the type and thus we can map from a type to a CoreSchema or SchemaValidator.

I expect this will slightly simplify some code (it essentially merges runtime parameters and compile time parameters into one code path where compile time just means a compile time defined mutation of runtime parameters) and have a minimal performance impact (we can still convert all configs to rust structs ahead of time, it’s just a struct being passed in through context vs hardcoded on the validator).

I think we might as well also establish or consider current behavior wrt rules for config merging and interaction with runtime parameters.

@davidhewitt @sydney-runkle wdyt?

samuelcolvin commented 1 week ago

In principle, sounds good to me.

My concern is performance - there may be some config options that result in significantly different validators being built, I've no idea if those switches are widely used, but if they are, it could be a pain.

adriangb commented 1 week ago

Good point. I know we do that with float validators. Do you think that will be a big performance impact? I'm hoping we can solve startup performance and feature requests like a replace types config with this sort of thing.

samuelcolvin commented 1 week ago

My guess is the impact will be small, especially if we do something like

if let Some(config) = opt_config {
  if let Some(first_thing) = config.first_thing {
     ...
  }
  if let Some(other_thing) = config.other_thing {
     ...
  }
}

Instead of just

  if let Some(first_thing) = config.first_thing {
     ...
  }
  if let Some(other_thing) = config.other_thing {
     ...
  }

So for the common case we only have one branch point. We could perhaps also make some validators generic, and thereby compile multiple configs, then choose them based on config?

adriangb commented 1 week ago

Yeah seems like something we can work to optimize if needed.

davidhewitt commented 1 week ago

Agreed on all of the above, I think this will make things like #993 much easier to reason about

sydney-runkle commented 1 week ago

This seems like a good idea to me.

I have some concerns about how complex our pydantic-core logic is when it comes to managing serialization preferences, especially when it comes to differentiating settings that are pushed down to nested models (like runtime flags) vs config settings, for which we generally don't do this. For example, consider these structures:

So we have a SerializationState struct, which has a SerializationConfig field. That struct is constructed via individual settings passed to SerializationState, like inf_nan_mode. SerializationState has an extra method that returns an Extra struct instance based on information pulled from the SerializationState instance, which is where we find a lot of the config settings like by_alias, etc, but this also includes a reference to &self.config, which is that SerializationConfig instance.

If you're not already confused, it gets worse...

All of this to say, I think that the changes you're proposing sound good, and we might be able to refactor some of the above in the process.

adriangb commented 1 week ago

I started an attempt in #1341. If we can keep this backwards compatible that would be a huge win. But probably still worth doing as much as reasonable in 1 PR just to fully understand the impact of the change.

adriangb commented 1 week ago

One thing to think about: this change means that anything with a config always pushes it down. E.g. right now you can have strict on a list but not it's items (I'm not sure if that's possible or happens via public Python or pydantic APIs but it certainly is possible from the pydantic-core Rust APIs point of view). This change would make that strict get pushed down to the items. Maybe this is fine, but I worry there are other cases where it's not fine.

One way around this would be to add a ResetConfig validator that resets it to default (or to some previous config?) so we can do ApplyConfig -> List -> ResetConfig(PreviousConfig) if we encounter {'type': 'list', 'strict': True, ...}. I'm guessing we generate {'type': 'list', 'strict': True, ...} from Annotated[list[...], Field(strict=True)] or similar (as opposed to x: list[...] on a model with strict=True). So we need to make sure that with this change we can map Annotated[list[...], Field(strict=True)] -> ApplyConfig -> List -> Items and list[...] -> List -> Items for things to work nicely. But then the above proposal of inserting a ResetConfig between List and Items wouldn't really work. We could say strict is part of the type but that seems dangerous conceptually.

sydney-runkle commented 1 week ago

One thing to think about: this change means that anything with a config always pushes it down.

I'm concerned about making a change like this in a minor version - I fear this would be a breaking change for some folks...

adriangb commented 1 week ago

Any ideas then what we do about things that can have values derived from configs that don't get pushed down? Are those a separate case?

sydney-runkle commented 1 week ago

I'm a bit stumped here. Perhaps a good convo for our oss meeting tomorrow.

Just a concrete example for the current strict behavior:

from pydantic import BaseModel, ConfigDict

# note, strict defaults to false
class NotStrictModel(BaseModel):
    b: int

class StrictModel(BaseModel):
    a: int
    not_strict_model: NotStrictModel

    model_config = ConfigDict(strict=True)

# currently works, strict isn't pushed down to submodels
sm = StrictModel(**{'a': 1, 'not_strict_model': {'b': '2'}})
print(repr(sm))
#> StrictModel(a=1, not_strict_model=NotStrictModel(b=2))
adriangb commented 1 week ago

I think that example would be fine. Models / dataclasses always stop configs from being pushed down. They'd continue to do that by (1) always having a default config in Pydntic and thus (2) always inserting an ApplyConfig thing.