unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Support built-in dataframe checks and dataframe-level options #383

Closed cosmicBboy closed 3 years ago

cosmicBboy commented 3 years ago

Is your feature request related to a problem? Please describe.

Discussion in https://github.com/pandera-dev/pandera/issues/382 has lead to the need to support validation checks at the dataframe level. We'll need to discuss further to converge on the ideal API for this, keeping in mind that we need to optimize for (a) a good UX for the object-based and class-based API and (b) support data synthesis strategies, as all schemas should be able to produce data synthesis strategies that produce valid data.

cosmicBboy commented 3 years ago

Jumping off of this comment: https://github.com/pandera-dev/pandera/issues/382#issuecomment-759025907

To decompose this issue into 2 problems:

Should we have keyword validation options in the DataFrameSchema constructor?

One issue with 2. is that it could bloat the api, especially SchemaModel. Afterwards we could imagine other checks that could be done dataframe-wide, where do we draw the line?

I think the slope isn't so slippery here. On a case-by-case basis we can determine whether a validation check is so fundamental that it should be a keyword option.

For example, the pandas_dtype and nullable options in SeriesSchemaBase could be implemented as a Check, however it is intertwined with how to implement the validation routine when it comes to type-coercion. The built-in Checks, on the other hand, are self-contained and don't really interact with any other part of the validation process besides checking the contents of the dataframe/series.

Similarly, the min number of rows in a dataframe is (i) a common use case, (ii) determines whether the validated dataframe can be empty (min_rows = 0 means it can, which ties nicely into https://github.com/pandera-dev/pandera/issues/332), and (iii) interacts with the data synthesis strategy because the size argument would need to be at least min_rows. I think this is enough to justify min_rows as a keyword option in the DataFrameSchema constructor.

How to best-support built-in checks in the SchemaModel API?

That said, we should also support built-in checks at the dataframe-level. I believe all of the built-in Checks currently implemented support both a dataframe and column/series input (will need to refactor the types, e.g. https://github.com/pandera-dev/pandera/blob/master/pandera/checks.py#L460, which are mis-leading). The new extensions module introduced the concept of supported_types for checks, which should probably be folded into the Check class.

pa.DataFrameSchema(..., checks = pa.Check.NotEmpty())

class Schema(pa.SchemaModel):
    class Config:
        checks = [pa.Check.NotEmpty()] # new

I think this solution works, my concern here is that there would then be a discrepency between how built-in checks are expressed in Fields and how they're expressed at the dataframe-level.

Field(gt=100)

# vs

Field(checks=[pa.Check.gt(100)])

I wonder if we could follow this by doing something like:

class Schema(pa.SchemaModel):
    class Config:
        # support built-in checks as config attributes
        gt = 0
        lt = 100
jeffzi commented 3 years ago

Should we have keyword validation options in the DataFrameSchema constructor?

I agree with all your points about adding min_rows :+1:

How to best-support built-in checks in the SchemaModel API?

Built-in checks as config attributes is indeed more consistent. We should interpret Config unknown attributes as registered checks. At the moment, unknown attributes are ignored.

Side note, we should encourage this style if we add more attributes:

import pandera as pa
from pandera.model import BaseConfig

class Schema(pa.SchemaModel): 
    class Config(BaseConfig): # IDE can help auto-complete valid attributes
        gt = 0
        lt = 100
antonl commented 3 years ago

Are there plans for making global checks serializable? I'd like to use pandera as a data definition on disk of sorts. Right now, global checks are just dropped when calling to_yaml. It would be nice to save them as keywords and look up from the extension API, but as far as I know, there is no way to do this.

cosmicBboy commented 3 years ago

Hey @antonl, I had to double check, but any check registered via extensions.register_check_method should be serializable as a yaml file or python script, see this post I just made: https://github.com/pandera-dev/pandera/discussions/417.

I did find a bug in the current serialization implementation that errors out when providing pandas_dtype arguments that are not of type PandasDtype, so for now you'll have to use e.g. pa.Int, pa.Float, pa.String, etc. instead of python built-in types, pandas type string aliases, or numpy types. Will make an issue and fix for this.

antonl commented 3 years ago

Hi @cosmicBboy! Thank you for your quick reply, I really like the library! I've posted a test case to demonstrate what I meant in the discussion you've linked.

antonl commented 3 years ago

@jeffzi @cosmicBboy I'm interested in working on this.

How would you support checks with multiple parameters?

import pandera as pa
from pandera.model import BaseConfig

class Schema(pa.SchemaModel): 
    class Config(BaseConfig): # IDE can help auto-complete valid attributes
        # zero_stat_check ?
        one_stat_check = 0
        # two_stat_check = ("a", 2)? {"first": "a", "second": 2}?
cosmicBboy commented 3 years ago

thanks @antonl, work on this would be much appreciated!

Currently the way these are handled for Field checks is with a dict.

class Schema(pa.SchemaModel):
    col: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 100})

So the {"first": "a", "second": 2} makes sense, although the tuple idea also makes sense and potentially more concise in some cases, maybe that syntax should be supported too (thoughts @jeffzi?)

Another thought I had about this issue is I wonder whether built-in dataframe checks make sense to specify in Config. This may be over-engineering it, but semantically it would make more sense to put in a different class, where Config should be reserved for configuration options.

class Schema(pa.SchemaModel):
    class Checks(GlobalChecks):  # or something 🤷‍♂️
        eq = 0
        ....
jeffzi commented 3 years ago

...the tuple idea also makes sense and potentially more concise in some cases,

Agreed, we could interpret a tuple as positional arguments and dict as keyword arguments.

Regarding DataFrame-level checks (global checks). I'm reluctant to add another inner class. That's not very pythonic (as far as I know). Config was inspired by pydantic to increase adoption, and I couldn't find a better alternative.

Config mirrors DataFrameSchema arguments. Adding built-in checks would be the exception but it's a similar concept to built-in checks in Field so it should be intuitive.

@antonl Your help is appreciated indeed :tada: Any thoughts about the inner class idea?

antonl commented 3 years ago

Regarding the inner class idea, I'm not a fan. Most python people don't even know you can declare classes like this.

How about keeping the same field syntax for cols/index and making a class decorator to enable this sugar,

@with_checks(pa.Check.min_rows(5), ...)
class Schema(pa.SchemaModel):
    col: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 100})

The implementation would just attach a _checks/checks field to the type. Subclassing would just append to the field if it exists.

You could even enable something like this,

@with_checks(min_rows={"count": 5})
class Schema(pa.SchemaModel):
   ...

That maintains the feel of Field checks as desired.

cosmicBboy commented 3 years ago

I think for consistency of the SchemaModel interface, I'd like to keep two requirements:

  1. the user doesn't have to worry about Check objects because checks are either built-in or expressed as methods decorated with @check or @dataframe_check.
  2. minimize the surface area of how checks are expressed. Currently its with Field(<builtin_check>=...), @check methods, and registering custom checks with the extensions API and then using them as built-ins in Field.

Regarding DataFrame-level checks (global checks). I'm reluctant to add another inner class.

Regarding the inner class idea, I'm not a fan

I do agree, but I figured it was worth considering 🤔

That's not very pythonic (as far as I know).

As an aside, my understanding of what is "pythonic" has changed so many times over the years I don't really even know what it means anymore 😅

Anyway, I'm inclined to stick with defining built-in dataframe checks in Config with the same behavior as Field: built-in and registered checks can be specified as attributes, where custom checks are expressed as @dataframe_check methods.

antonl commented 3 years ago

Let me know if #478 looks acceptable, @cosmicBboy and @jeffzi! It's a little weird that we have to specify stuff on the Config class implicitly, but it is easy enough to implement.

We should perhaps have a "best practices" section in the doc that highlights the difference between methods decorated with dataframe_check and through Config. Do we recommend one or the other?

jeffzi commented 3 years ago

Best practices and cookbook sections would be great.

It's a little weird that we have to specify stuff on the Config class implicitly,

I kind of agree but Field can also implicitly swallow custom checks. Perhaps it feels less natural with the inner class Config because this mechanism is uncommon.

An alternative to the inner class could be a special attribute __config__ that must be assigned a Config instance.


class Base(pa.SchemaModel):
    a: Series[int]
    __config__ = Config(
        name="Base schema",
        coerce=True,
        ordered=True,
        multiindex_coerce=True,
        multiindex_strict=True,
        multiindex_name="mi",
        custom_check=(1, 2)
    )

Just testing the water, I can open a discussion if any of you @antonl @cosmicBboy think something like that is worth investigating.

We should perhaps have a "best practices" section in the doc that highlights the difference between methods decorated with dataframe_check and through Config. Do we recommend one or the other?

Imho, @check and @dataframe_check decorators are for defining one-offs, similarly to a Check defined with a lambda function in the DataFrameSchema api, or at least checks specific to a given model. Registered checks are for reusable checks.

antonl commented 3 years ago

Sorry I wasn't clear @jeffzi . I think the Config class is fine. I was just concerned that there are two ways to add dataframe checks (using methods and adding keys to Config), and they look totally different.

Also, the two ways have trade-offs: one is a one-off, but prevents serialization, and the other is global.

cosmicBboy commented 3 years ago

fixed by #478