Closed cosmicBboy closed 3 years ago
Jumping off of this comment: https://github.com/pandera-dev/pandera/issues/382#issuecomment-759025907
To decompose this issue into 2 problems:
DataFrameSchema
constructor?One issue with 2. is that it could bloat the api, especially SchemaModel. Afterwards we could imagine other checks that could be done dataframe-wide, where do we draw the line?
I think the slope isn't so slippery here. On a case-by-case basis we can determine whether a validation check is so fundamental that it should be a keyword option.
For example, the pandas_dtype
and nullable
options in SeriesSchemaBase
could be implemented as a Check
, however it is intertwined with how to implement the validation routine when it comes to type-coercion. The built-in Check
s, on the other hand, are self-contained and don't really interact with any other part of the validation process besides checking the contents of the dataframe/series.
Similarly, the min number of rows in a dataframe is (i) a common use case, (ii) determines whether the validated dataframe can be empty (min_rows = 0
means it can, which ties nicely into https://github.com/pandera-dev/pandera/issues/332), and (iii) interacts with the data synthesis strategy because the size
argument would need to be at least min_rows
. I think this is enough to justify min_rows
as a keyword option in the DataFrameSchema
constructor.
SchemaModel
API?That said, we should also support built-in checks at the dataframe-level. I believe all of the built-in Check
s currently implemented support both a dataframe and column/series input (will need to refactor the types, e.g. https://github.com/pandera-dev/pandera/blob/master/pandera/checks.py#L460, which are mis-leading). The new extensions module introduced the concept of supported_types
for checks, which should probably be folded into the Check
class.
pa.DataFrameSchema(..., checks = pa.Check.NotEmpty())
class Schema(pa.SchemaModel):
class Config:
checks = [pa.Check.NotEmpty()] # new
I think this solution works, my concern here is that there would then be a discrepency between how built-in checks are expressed in Field
s and how they're expressed at the dataframe-level.
Field(gt=100)
# vs
Field(checks=[pa.Check.gt(100)])
I wonder if we could follow this by doing something like:
class Schema(pa.SchemaModel):
class Config:
# support built-in checks as config attributes
gt = 0
lt = 100
Should we have keyword validation options in the DataFrameSchema constructor?
I agree with all your points about adding min_rows
:+1:
How to best-support built-in checks in the SchemaModel API?
Built-in checks as config attributes is indeed more consistent. We should interpret Config unknown attributes as registered checks. At the moment, unknown attributes are ignored.
Side note, we should encourage this style if we add more attributes:
import pandera as pa
from pandera.model import BaseConfig
class Schema(pa.SchemaModel):
class Config(BaseConfig): # IDE can help auto-complete valid attributes
gt = 0
lt = 100
Are there plans for making global checks serializable? I'd like to use pandera
as a data definition on disk of sorts. Right now, global checks are just dropped when calling to_yaml
. It would be nice to save them as keywords and look up from the extension API, but as far as I know, there is no way to do this.
Hey @antonl, I had to double check, but any check registered via extensions.register_check_method
should be serializable as a yaml file or python script, see this post I just made: https://github.com/pandera-dev/pandera/discussions/417.
I did find a bug in the current serialization implementation that errors out when providing pandas_dtype
arguments that are not of type PandasDtype
, so for now you'll have to use e.g. pa.Int
, pa.Float
, pa.String
, etc. instead of python built-in types, pandas type string aliases, or numpy types. Will make an issue and fix for this.
Hi @cosmicBboy! Thank you for your quick reply, I really like the library! I've posted a test case to demonstrate what I meant in the discussion you've linked.
@jeffzi @cosmicBboy I'm interested in working on this.
How would you support checks with multiple parameters?
import pandera as pa
from pandera.model import BaseConfig
class Schema(pa.SchemaModel):
class Config(BaseConfig): # IDE can help auto-complete valid attributes
# zero_stat_check ?
one_stat_check = 0
# two_stat_check = ("a", 2)? {"first": "a", "second": 2}?
thanks @antonl, work on this would be much appreciated!
Currently the way these are handled for Field
checks is with a dict.
class Schema(pa.SchemaModel):
col: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 100})
So the {"first": "a", "second": 2}
makes sense, although the tuple
idea also makes sense and potentially more concise in some cases, maybe that syntax should be supported too (thoughts @jeffzi?)
Another thought I had about this issue is I wonder whether built-in dataframe checks make sense to specify in Config
. This may be over-engineering it, but semantically it would make more sense to put in a different class, where Config
should be reserved for configuration options.
class Schema(pa.SchemaModel):
class Checks(GlobalChecks): # or something 🤷♂️
eq = 0
....
...the tuple idea also makes sense and potentially more concise in some cases,
Agreed, we could interpret a tuple as positional arguments and dict as keyword arguments.
Regarding DataFrame-level checks (global checks). I'm reluctant to add another inner class. That's not very pythonic (as far as I know). Config
was inspired by pydantic
to increase adoption, and I couldn't find a better alternative.
Config
mirrors DataFrameSchema
arguments. Adding built-in checks would be the exception but it's a similar concept to built-in checks in Field
so it should be intuitive.
@antonl Your help is appreciated indeed :tada: Any thoughts about the inner class idea?
Regarding the inner class idea, I'm not a fan. Most python people don't even know you can declare classes like this.
How about keeping the same field syntax for cols/index and making a class decorator to enable this sugar,
@with_checks(pa.Check.min_rows(5), ...)
class Schema(pa.SchemaModel):
col: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 100})
The implementation would just attach a _checks/checks
field to the type. Subclassing would just append to the field if it exists.
You could even enable something like this,
@with_checks(min_rows={"count": 5})
class Schema(pa.SchemaModel):
...
That maintains the feel of Field checks as desired.
I think for consistency of the SchemaModel
interface, I'd like to keep two requirements:
Check
objects because checks are either built-in or expressed as methods decorated with @check
or @dataframe_check
.Field(<builtin_check>=...)
, @check
methods, and registering custom checks with the extensions API and then using them as built-ins in Field
.Regarding DataFrame-level checks (global checks). I'm reluctant to add another inner class.
Regarding the inner class idea, I'm not a fan
I do agree, but I figured it was worth considering 🤔
That's not very pythonic (as far as I know).
As an aside, my understanding of what is "pythonic" has changed so many times over the years I don't really even know what it means anymore 😅
Anyway, I'm inclined to stick with defining built-in dataframe checks in Config
with the same behavior as Field
: built-in and registered checks can be specified as attributes, where custom checks are expressed as @dataframe_check
methods.
Let me know if #478 looks acceptable, @cosmicBboy and @jeffzi! It's a little weird that we have to specify stuff on the Config class implicitly, but it is easy enough to implement.
We should perhaps have a "best practices" section in the doc that highlights the difference between methods decorated with dataframe_check
and through Config. Do we recommend one or the other?
Best practices and cookbook sections would be great.
It's a little weird that we have to specify stuff on the Config class implicitly,
I kind of agree but Field
can also implicitly swallow custom checks. Perhaps it feels less natural with the inner class Config
because this mechanism is uncommon.
An alternative to the inner class could be a special attribute __config__
that must be assigned a Config
instance.
class Base(pa.SchemaModel):
a: Series[int]
__config__ = Config(
name="Base schema",
coerce=True,
ordered=True,
multiindex_coerce=True,
multiindex_strict=True,
multiindex_name="mi",
custom_check=(1, 2)
)
Just testing the water, I can open a discussion if any of you @antonl @cosmicBboy think something like that is worth investigating.
We should perhaps have a "best practices" section in the doc that highlights the difference between methods decorated with dataframe_check and through Config. Do we recommend one or the other?
Imho, @check
and @dataframe_check
decorators are for defining one-offs, similarly to a Check
defined with a lambda function in the DataFrameSchema
api, or at least checks specific to a given model. Registered checks are for reusable checks.
Sorry I wasn't clear @jeffzi . I think the Config
class is fine. I was just concerned that there are two ways to add dataframe checks (using methods and adding keys to Config), and they look totally different.
Also, the two ways have trade-offs: one is a one-off, but prevents serialization, and the other is global.
fixed by #478
Is your feature request related to a problem? Please describe.
Discussion in https://github.com/pandera-dev/pandera/issues/382 has lead to the need to support validation checks at the dataframe level. We'll need to discuss further to converge on the ideal API for this, keeping in mind that we need to optimize for (a) a good UX for the object-based and class-based API and (b) support data synthesis strategies, as all schemas should be able to produce data synthesis strategies that produce valid data.