unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Are Configs not inherited by design? #983

Closed kr-hansen closed 1 year ago

kr-hansen commented 1 year ago

It seems that Configs on SchemaModels are not inherited. Is this by design?

I've built some pandera schema models that inherit from one another, but it seems that pandera SchemaModels don't inherit the Config from one another. Is this by design or am I doing something wrong?

For example:

from pandera.typing import Series
import pandera as pa

class Model1(pa.SchemaModel):
    col1: Series[int]
    class Config:
        unique=['col1']

class Model2(Model1):
    col2: Series[str]

With the above SchemaModels defined, I would expect that Model2 would also have an attribute of Model2.Config.unique that was equal to ['col1'] but that doesn't seem to get inherited. For every inherited subclass is it expected to re-define the Config and re-inherit what is defined in the parent classes?

My current work around which works ok is

class Model2(Model1):
    col2: Series[str]
    Config = Model1.Config

However, I wasn't sure if there was additional metadata that gets initialized in a blank Config that I'm overwriting by doing this or if there would be a better practice to better inherit if there is a design reason to not have the config get inherited by subsequent schemas. This pattern just went against my assumptions in building out mixtures of pandera schemas for my data.

Note that I posted this on Stack Overflow with the suggestion being to raise an issue/question on the repo itself.

cosmicBboy commented 1 year ago

Hi @kr-hansen so as background, the SchemaModel classes inherit field attributes but not configs by design (@jeffzi might be able to provide more insight on the rationale) but the configuration settings are effectively inherited when you do Model2.to_schema(). Ultimately, SchemaModels are converted into DataFrameSchemas are used under the hood for before they do any runtime validation.

So for example:

import pandera as pa
from pandera.typing import Series

class Model1(pa.SchemaModel):
    col1: Series[int]

    class Config:
        unique=['col1']

class Model2(Model1):
    col2: Series[str]

print("Model1")
print(Model1.to_schema().to_yaml())
print("\nModel2")
print(Model2.to_schema().to_yaml())

output:

Model1
schema_type: dataframe
version: 0.0.0+dev0
columns:
  col1:
    title: null
    description: null
    dtype: int64
    nullable: false
    checks: null
    unique: false
    coerce: false
    required: true
    regex: false
checks: null
index: null
coerce: false
strict: false
unique:
- col1
ordered: false

Model2
schema_type: dataframe
version: 0.0.0+dev0
columns:
  col1:
    title: null
    description: null
    dtype: int64
    nullable: false
    checks: null
    unique: false
    coerce: false
    required: true
    regex: false
  col2:
    title: null
    description: null
    dtype: str
    nullable: false
    checks: null
    unique: false
    coerce: false
    required: true
    regex: false
checks: null
index: null
coerce: false
strict: false
unique:
- col1
ordered: false

As you can see the unique: ["col1"] attribute is passed down to Model2.

I can't remember the exact technical reasons why Configs aren't actually inherited (via Python class inheritance semantics) but I suspect that it was easier to implement overriding behavior such that Config.<attributes> defined in subclasses take precedence over those defined in parent classes.

Is there a particular reason you need this behavior? Maybe worth revisiting this.

kr-hansen commented 1 year ago

Oh interesting. Thanks for your clear explanation replicated with the example.

My primary use-case came up in that I've built essentially a data hierarchy for the models expected at different points in our processing pipeline. I have basic models, then models that inherit from multiple models (ie. class Model3(Model1, Model2)). I'd written a helper function to extract the uniqueness constraints for a given combination model that essentially would access Model3.Config.unique to return the columns that are specified as being unique in that context. My unit test for this function kept failing because if I built Model3 like the above, Model3.Config.unique returns a None since Config isn't explicitly defined in Model3. I see now that if I do Model3.to_schema().unique then it properly inherits ['col1'] as I was expecting.

However, it does seem that the .to_schema() method just inherits the Config from only the first class passed in. Does that seem correct? I based this on the following example tweaked from above:

from pandera.typing import Series
import pandera as pa

class Model1(pa.SchemaModel):
    col1: Series[int]
    class Config:
        unique=['col1']

class Model2(pa.SchemaModel):
    col2: Series[str]
    class Config:
        unique=['col2']

class Model3(Model1, Model2):
    pass

If I do Model3.to_schema().unique all I get back is ['col1']. I'm guessing this is probably expected for simplicity as far as how to best inherit a Config from other classes?

It also seems that it just picks up the first Config that is defined and ignores empty Configs? I'm assuming this from the following example tacked on to the above example:

class Model4(pa.SchemaModel):
    col4: Series[int]

class Model5(Model4, Model2, Model1):
    pass

If I do Model5.to_schema.unique here I get back ['col2']. I was expecting to have a blank Config here since Model4 was first defined. However it seems since Model4 doesn't have a Config defined, it skipped until it found a defined Config with Model2 and picked up that Config.

I mostly bring this up as a question to better understand what Pandera is doing and how it picks up different elements of SchemaModels from inheritance (Configs vs Fields). I don't know if this necessitates revisiting this, I mostly raised the question as it seems pandera behaves slightly differently than what my expectations were (it would inherit the union of all parent models unless explicitly overwritten).

For my specific use-case, what you outlined about using .to_schema() would probably work just fine. However, at some point for my project we may want to be combining models with different unique-ness constraints, so with that in mind I will probably be more explicit in these inheritance cases and I can combine unique elements from the Parent models explicitly in my code to avoid any confusion there like I showed in the original example (ie. Model3.Config = Model1.Config.unique + Model2.Config.unique). This was helpful to explore either way for me.

cosmicBboy commented 1 year ago

I think the behavior you're seeing can be explained with Python inheritance... my memory's a little rusty here, but I believe Python will inherit attributes based on the order in which they're specified in the class MyClass(...) definition, with parent classes earlier in the class definition taking precedence.

If I do Model3.to_schema().unique all I get back is ['col1']

If I'm correct in the above statement, this is expected... Python won't automagically merge the two Config definitions, pandera would have to implement that.

If I do Model5.to_schema.unique here I get back ['col2']. I was expecting to have a blank Config here since Model4

So since Model4 doesn't define a Config class defined inside, Python will use the one in Model2. Again, I believe this is how Python inheritence works.

For my specific use-case, what you outlined about using .to_schema() would probably work just fine

Cool, I'll close this issue for now, please feel free to open a new one describing the precise behavior you'd want, ideally with code examples as you've done in this thread.