Open hsorsky opened 2 years ago
Hi @hsorsky the solution you describe with schema_1.validate(schema_2.example(0))
is actually the way I'd recommend doing it.
Doing extra magic like schema_1.validate(schema_2)
seems confusing to me... it looks like I'm trying to validate schema_2
itself instead of the data that it generates.
Are you using this in the context of unit tests? If so I'd recommend using the hypothesis
library instead of .example()
: https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#usage-in-unit-tests.
Example usecase would be to check that two transformers are valid to be performed one after another
By transformer do you mean a function?
Let me know if I'm missing something, sometimes example code helps me to grok the intent behind the feature request.
My specific usecase is in building machine learning pipelines that consist of dataframe transformers (i.e. dataframe_transformed = transformer.transform(dataframe)
). Due to the compute intensive nature of these pipelines, we want to be able to validate that all of the components "fit together" and can be used one after the other before we actually run any data through them. Our solution was to attach schemas to each transformer defining:
and then check that all of the schemas fit together.
When it comes to validating a whole pipeline, we start with a schema defining the structure of the data that will be fed into said pipeline, and then it is run through the validation, checking that each component fits with the next. As part of this, one needs to check that the output of step i
is valid input to step i + 1
. The most natural way to do this, IMO, would be via something like schema_1.validate(schema_2)
, i.e. checking that data conforming to schema_2
would conform to schema_1
. As pseudo code this might look like:
schema = ... # some schema that we derive from data, or get as output from some data fetching pipeline, etc.
for transformer in transformers:
transformer.input_schema.validate(schema)
schema = transformer.transform_schema(schema)
transformer.output_schema.validate(schema)
The issue I can see with relying on schema_1.validate(schema_2.example(0))
is that schema_2.example(0)
would generate a DataFrame that loses a lot of context, e.g. which columns are nullable and which aren't. I think by checking schemas against one another, you can ensure that such context agrees between the schemas
Cool, this use case makes sense, though I have an additional question:
In the code example above, how is the output of transformer.transform_schema(schema)
different from transformer.output_schema
? Are they meant to be the same? Is the intent that transform_schema
implements schema transformations that are validated with output_schema.validate(schema)
?
re: the concerns with schema_2.example(0)
, would you be able to generate a dataframe with size > 0
so that it contains some of the context?
schema.validate(other_schema)
I think I'm still missing something. Suppose we implement schema.validate(other_schema)
... it's unclear to me what this operation would actually be doing under the hood besides schema.validate(other_schema.example(0))
.
Are you thinking that this would actually analyze the columns, types, and constraints of other_schema
and return True
if other_schema
is a "subtype" of schema
? (by "subtype" I mean all data generated by other_schema
is valid under constraints in schema
)
In the code example above, how is the output of transformer.transform_schema(schema) different from transformer.output_schema? Are they meant to be the same? Is the intent that transform_schema implements schema transformations that are validated with output_schema.validate(schema)?
Apologies, I wasn't very clear in my example. I'm imagining input_schema
and output_schema
to be attributes/properties of type pa.DataFrameSchema
. transform_schema
, however, would be a method with a signature like
def transform_schema(self, schema: pa.DataFrameSchema) -> pa.DataFrameSchema:
that "transforms" the schema itself.
The need for this would be, for example, if you had a transformer that appends a column to a DataFrame. In this case, the output_schema
would just be something like
pa.DataFrameSchema({new_col: pa.Column()}, strict=False)
But what about if the DataFrame that's expected to come into the transformer in a given instance will also have some other columns, e.g. has the schema
pa.DataFrameSchema({some_other_col: pa.Column()}, strict=True)
then upon being transformed, the DataFrame would now have both columns there, and so the transformed schema should reflect that, i.e. be
pa.DataFrameSchema({some_other_col: pa.Column(), new_col: pa.Column()}, strict=True)
i.e. it should reflect the output schema PLUS anything that is passed through from input.
re: the concerns with schema_2.example(0), would you be able to generate a dataframe with size > 0 so that it contains some of the context?
I think you could, but I think it'd be quite messy and would mean you'd be generating a variety of different length DataFrames at each step, which seems odd to me.
Are you thinking that this would actually analyze the columns, types, and constraints of other_schema and return True if other_schema is a "subtype" of schema? (by "subtype" I mean all data generated by other_schema is valid under constraints in schema)
Yes, I think that summarises my thoughts well. Some of the checks might be things like:
schema.strict: True
then other_schema.strict: True
(so that data conforming to other_schema
can't contain extra columns)schema.strict: "filter"
then either:
other_schema.strict: False
; orother_schema
contains at least all of the necessary columnsdtype
s matchP.S. Thank you for taking the time to think this through and discuss - I realise now I could have provided better context in my original issue body
edit: also, if you were to accept this proposal, I'd be happy to help out/do the work on it
Cool, you have my support on this feature, it seems super useful!
However I think this proposal is still under-specified, but I'd be happy to help flesh it out further. Just to be clear, I'd estimate this to be a fairly heavy lift, so appreciate your willingness to contribute to this feature!
I'd start off with a proof of concept implemented in a gist or a WIP branch, but basically what we're talking about here is not exactly validation of data, but instead analyzing type compatibility between schemas.
I'll leave it up to you what the API looks like for the POC, but I'd recommend implementing a function similar to issubclass
that looks something like this:
def is_subtype(sub_schema: pa.DataFrameSchema, super_schema: pa.DataFrameSchema) -> bool:
...
Where is_subtype
returns True
if all data that's valid under sub_schema
is also valid under super_schema
, but not necessarily the other way around.
For example:
sub_schema = pa.DataFrameSchema({
"x": pa.Column(int, pa.Check.in_range(0, 100)),
"y": pa.Column(str, pa.Check.isin([*"abc"])),
})
super_schema = pa.DataFrameSchema({
"x": pa.Column(int),
"y": pa.Column(str),
})
assert is_subtype(sub_schema, super_schema)
assert not is_subtype(super_schema, sub_schema)
Clearly, sub_schema
is a special case of super_schema
, but not the other way around.
If schema.strict: True then other_schema.strict: True (so that data conforming to other_schema can't contain extra columns) If schema.strict: "filter" then either: other_schema.strict: False; or other_schema contains at least all of the necessary columns That any provided column dtypes match etc
So matching up many of the schema-level kwargs and column/index dtypes will be fairly straightforward, but the tricky part will be to analyze the compatibility between checks, for example:
sub_schema = pa.DataFrameSchema({
"x": pa.Column(int, pa.Check.in_range(0, 100)),
"y": pa.Column(str, pa.Check.isin([*"abc"])),
})
super_schema = pa.DataFrameSchema({
"x": pa.Column(int, pa.Check.in_range(-100, 200)),
"y": pa.Column(str, pa.Check.isin([*"abcde"])),
})
assert is_subtype(sub_schema, super_schema)
assert not is_subtype(super_schema, sub_schema)
The assertions here should still hold true, since the checks in super_schema
are broader than the checks in sub_schema
.
This gets even trickier with custom checks, since these are effectively black boxes as pandera is concerned... so I'd suggest we parcel this out in 3 phases:
pa.Check.isin
, pa.Check.gt
, etc)statistics
and can be analyzed by pandera. This will likely need an additional abstraction edit: in order to determine the relationship between two sets of statistics.re: inline custom checks (i.e. non-registered checks), we need to come up with a strategy, e.g. simply raising an error when calling is_subtype
on schemas with these checks, raising a suppressible user warning that pandera won't check for compatibility at the check-level, etc.
Anyway, please feel free to add more in the comments below, and link a WIP gist to get things going!
OK cool, thanks for writing all of that out. I agree with all. Sounds like a good plan of action. I'll post something when I've got something non-trivial and then we can build from there.
One thing I've noticed as I work on this PoC is that the answer isn't just yes/no, but yes, no and (often) maybe. e.g. if sub_schema
is non strict and has no explicitly provided columns and super_schema
has required column "col"
, whether sub_schema
conforms to super_schema
is 'maybe' (i.e. it would depend on your interpretation of conformity/validity in the case of 2 schemas). In terms of a dataframe generated by sub_schema.example
, said dataframe would conform to super_schema
, but in terms of a dataframe that sub_schema
would consider to be valid, super_schema
may not consider it valid. As such, I think we'd either need to decide more strictly around what is considered conformity, which is IMO bad because it forces the user into our definition, or let the user decide on those 'maybe' cases, which is IMO good because it hands them the power of decision.
To star, I lean towards treating those cases as yes (True
) cases (i.e. be generous on what we consider to be conformity, which I guess is making a, temporary, decision on the matter), but I'd like to consider potentially adding a third return value of "maybe"
(or potentially even more depending on if we want to offer handling different types of potential conformity differently) so that the user has more control on whether potential conformity is acceptable or not.
@cosmicBboy, do you know if panderas currently supports checking if a pa.DataType
can be cast to different one without explicitly creating an object of the first type and trying to convert to the second and seeing if it works?
hey @hsorsky no, currently there's no way to do this. I'd suggest implementing the strictest version of this (dtypes have to match exactly) and we can extend this once we have an is_castable
method (or something like it), which I'd consider a separate task.
Is your feature request related to a problem? Please describe. Provide a way for a schema to validate against another schema (e.g. to check that they're compatible). Example usecase would be to check that two transformers are valid to be performed one after another (output schema of transformer
n
is compatible with input schema of transformern + 1
)Describe the solution you'd like Similar to how we can do
schema.validate(df)
, it'd be nice to be able to doschema_1.validate(schema_2)
Describe alternatives you've considered A hacky way to get it done would be to use
schema_2.example
to generate fake data from to then feed intoschema_1.validate
, i.e.schema_1.validate(schema_2.example(0))
but this feels very sub-optimal