unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.34k stars 309 forks source link

Validate schema against another schema #963

Open hsorsky opened 2 years ago

hsorsky commented 2 years ago

Is your feature request related to a problem? Please describe. Provide a way for a schema to validate against another schema (e.g. to check that they're compatible). Example usecase would be to check that two transformers are valid to be performed one after another (output schema of transformer n is compatible with input schema of transformer n + 1)

Describe the solution you'd like Similar to how we can do schema.validate(df), it'd be nice to be able to do schema_1.validate(schema_2)

Describe alternatives you've considered A hacky way to get it done would be to use schema_2.example to generate fake data from to then feed into schema_1.validate, i.e. schema_1.validate(schema_2.example(0)) but this feels very sub-optimal

cosmicBboy commented 2 years ago

Hi @hsorsky the solution you describe with schema_1.validate(schema_2.example(0)) is actually the way I'd recommend doing it.

Doing extra magic like schema_1.validate(schema_2) seems confusing to me... it looks like I'm trying to validate schema_2 itself instead of the data that it generates.

Are you using this in the context of unit tests? If so I'd recommend using the hypothesis library instead of .example(): https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#usage-in-unit-tests.

Example usecase would be to check that two transformers are valid to be performed one after another

By transformer do you mean a function?

Let me know if I'm missing something, sometimes example code helps me to grok the intent behind the feature request.

hsorsky commented 2 years ago

My specific usecase is in building machine learning pipelines that consist of dataframe transformers (i.e. dataframe_transformed = transformer.transform(dataframe)). Due to the compute intensive nature of these pipelines, we want to be able to validate that all of the components "fit together" and can be used one after the other before we actually run any data through them. Our solution was to attach schemas to each transformer defining:

  1. A schema that data coming into the transformer should conform to
  2. A way to transform an incoming schema into an outgoing schema
  3. A schema that data coming out of the transformer should conform to

and then check that all of the schemas fit together.

When it comes to validating a whole pipeline, we start with a schema defining the structure of the data that will be fed into said pipeline, and then it is run through the validation, checking that each component fits with the next. As part of this, one needs to check that the output of step i is valid input to step i + 1. The most natural way to do this, IMO, would be via something like schema_1.validate(schema_2), i.e. checking that data conforming to schema_2 would conform to schema_1. As pseudo code this might look like:

schema = ...  # some schema that we derive from data, or get as output from some data fetching pipeline, etc.
for transformer in transformers:
    transformer.input_schema.validate(schema)
    schema = transformer.transform_schema(schema)
    transformer.output_schema.validate(schema)

The issue I can see with relying on schema_1.validate(schema_2.example(0)) is that schema_2.example(0) would generate a DataFrame that loses a lot of context, e.g. which columns are nullable and which aren't. I think by checking schemas against one another, you can ensure that such context agrees between the schemas

cosmicBboy commented 2 years ago

Cool, this use case makes sense, though I have an additional question:

In the code example above, how is the output of transformer.transform_schema(schema) different from transformer.output_schema? Are they meant to be the same? Is the intent that transform_schema implements schema transformations that are validated with output_schema.validate(schema)?

re: the concerns with schema_2.example(0), would you be able to generate a dataframe with size > 0 so that it contains some of the context?

Semantics of schema.validate(other_schema)

I think I'm still missing something. Suppose we implement schema.validate(other_schema)... it's unclear to me what this operation would actually be doing under the hood besides schema.validate(other_schema.example(0)).

Are you thinking that this would actually analyze the columns, types, and constraints of other_schema and return True if other_schema is a "subtype" of schema? (by "subtype" I mean all data generated by other_schema is valid under constraints in schema)

hsorsky commented 2 years ago

In the code example above, how is the output of transformer.transform_schema(schema) different from transformer.output_schema? Are they meant to be the same? Is the intent that transform_schema implements schema transformations that are validated with output_schema.validate(schema)?

Apologies, I wasn't very clear in my example. I'm imagining input_schema and output_schema to be attributes/properties of type pa.DataFrameSchema. transform_schema, however, would be a method with a signature like

def transform_schema(self, schema: pa.DataFrameSchema) -> pa.DataFrameSchema:

that "transforms" the schema itself.

The need for this would be, for example, if you had a transformer that appends a column to a DataFrame. In this case, the output_schema would just be something like

pa.DataFrameSchema({new_col: pa.Column()}, strict=False)

But what about if the DataFrame that's expected to come into the transformer in a given instance will also have some other columns, e.g. has the schema

pa.DataFrameSchema({some_other_col: pa.Column()}, strict=True)

then upon being transformed, the DataFrame would now have both columns there, and so the transformed schema should reflect that, i.e. be

pa.DataFrameSchema({some_other_col: pa.Column(), new_col: pa.Column()}, strict=True)

i.e. it should reflect the output schema PLUS anything that is passed through from input.


re: the concerns with schema_2.example(0), would you be able to generate a dataframe with size > 0 so that it contains some of the context?

I think you could, but I think it'd be quite messy and would mean you'd be generating a variety of different length DataFrames at each step, which seems odd to me.


Are you thinking that this would actually analyze the columns, types, and constraints of other_schema and return True if other_schema is a "subtype" of schema? (by "subtype" I mean all data generated by other_schema is valid under constraints in schema)

Yes, I think that summarises my thoughts well. Some of the checks might be things like:


P.S. Thank you for taking the time to think this through and discuss - I realise now I could have provided better context in my original issue body

edit: also, if you were to accept this proposal, I'd be happy to help out/do the work on it

cosmicBboy commented 2 years ago

Cool, you have my support on this feature, it seems super useful!

However I think this proposal is still under-specified, but I'd be happy to help flesh it out further. Just to be clear, I'd estimate this to be a fairly heavy lift, so appreciate your willingness to contribute to this feature!

I'd start off with a proof of concept implemented in a gist or a WIP branch, but basically what we're talking about here is not exactly validation of data, but instead analyzing type compatibility between schemas.

I'll leave it up to you what the API looks like for the POC, but I'd recommend implementing a function similar to issubclass that looks something like this:

def is_subtype(sub_schema: pa.DataFrameSchema, super_schema: pa.DataFrameSchema) -> bool:
    ...

Where is_subtype returns True if all data that's valid under sub_schema is also valid under super_schema, but not necessarily the other way around.

For example:

sub_schema = pa.DataFrameSchema({
    "x": pa.Column(int, pa.Check.in_range(0, 100)),
    "y": pa.Column(str, pa.Check.isin([*"abc"])),
})

super_schema = pa.DataFrameSchema({
    "x": pa.Column(int),
    "y": pa.Column(str),
})

assert is_subtype(sub_schema, super_schema)
assert not is_subtype(super_schema, sub_schema)

Clearly, sub_schema is a special case of super_schema, but not the other way around.

If schema.strict: True then other_schema.strict: True (so that data conforming to other_schema can't contain extra columns) If schema.strict: "filter" then either: other_schema.strict: False; or other_schema contains at least all of the necessary columns That any provided column dtypes match etc

So matching up many of the schema-level kwargs and column/index dtypes will be fairly straightforward, but the tricky part will be to analyze the compatibility between checks, for example:

sub_schema = pa.DataFrameSchema({
    "x": pa.Column(int, pa.Check.in_range(0, 100)),
    "y": pa.Column(str, pa.Check.isin([*"abc"])),
})

super_schema = pa.DataFrameSchema({
    "x": pa.Column(int, pa.Check.in_range(-100, 200)),
    "y": pa.Column(str, pa.Check.isin([*"abcde"])),
})

assert is_subtype(sub_schema, super_schema)
assert not is_subtype(super_schema, sub_schema)

The assertions here should still hold true, since the checks in super_schema are broader than the checks in sub_schema.

This gets even trickier with custom checks, since these are effectively black boxes as pandera is concerned... so I'd suggest we parcel this out in 3 phases:

re: inline custom checks (i.e. non-registered checks), we need to come up with a strategy, e.g. simply raising an error when calling is_subtype on schemas with these checks, raising a suppressible user warning that pandera won't check for compatibility at the check-level, etc.

Anyway, please feel free to add more in the comments below, and link a WIP gist to get things going!

hsorsky commented 2 years ago

OK cool, thanks for writing all of that out. I agree with all. Sounds like a good plan of action. I'll post something when I've got something non-trivial and then we can build from there.

hsorsky commented 2 years ago

One thing I've noticed as I work on this PoC is that the answer isn't just yes/no, but yes, no and (often) maybe. e.g. if sub_schema is non strict and has no explicitly provided columns and super_schema has required column "col", whether sub_schema conforms to super_schema is 'maybe' (i.e. it would depend on your interpretation of conformity/validity in the case of 2 schemas). In terms of a dataframe generated by sub_schema.example, said dataframe would conform to super_schema, but in terms of a dataframe that sub_schema would consider to be valid, super_schema may not consider it valid. As such, I think we'd either need to decide more strictly around what is considered conformity, which is IMO bad because it forces the user into our definition, or let the user decide on those 'maybe' cases, which is IMO good because it hands them the power of decision.

To star, I lean towards treating those cases as yes (True) cases (i.e. be generous on what we consider to be conformity, which I guess is making a, temporary, decision on the matter), but I'd like to consider potentially adding a third return value of "maybe" (or potentially even more depending on if we want to offer handling different types of potential conformity differently) so that the user has more control on whether potential conformity is acceptable or not.

hsorsky commented 2 years ago

@cosmicBboy, do you know if panderas currently supports checking if a pa.DataType can be cast to different one without explicitly creating an object of the first type and trying to convert to the second and seeing if it works?

cosmicBboy commented 1 year ago

hey @hsorsky no, currently there's no way to do this. I'd suggest implementing the strictest version of this (dtypes have to match exactly) and we can extend this once we have an is_castable method (or something like it), which I'd consider a separate task.