Closed kr-hansen closed 1 year ago
Hi @kr-hansen so as background, the SchemaModel
classes inherit field attributes but not configs by design (@jeffzi might be able to provide more insight on the rationale) but the configuration settings are effectively inherited when you do Model2.to_schema()
. Ultimately, SchemaModel
s are converted into DataFrameSchema
s are used under the hood for before they do any runtime validation.
So for example:
import pandera as pa
from pandera.typing import Series
class Model1(pa.SchemaModel):
col1: Series[int]
class Config:
unique=['col1']
class Model2(Model1):
col2: Series[str]
print("Model1")
print(Model1.to_schema().to_yaml())
print("\nModel2")
print(Model2.to_schema().to_yaml())
output:
Model1
schema_type: dataframe
version: 0.0.0+dev0
columns:
col1:
title: null
description: null
dtype: int64
nullable: false
checks: null
unique: false
coerce: false
required: true
regex: false
checks: null
index: null
coerce: false
strict: false
unique:
- col1
ordered: false
Model2
schema_type: dataframe
version: 0.0.0+dev0
columns:
col1:
title: null
description: null
dtype: int64
nullable: false
checks: null
unique: false
coerce: false
required: true
regex: false
col2:
title: null
description: null
dtype: str
nullable: false
checks: null
unique: false
coerce: false
required: true
regex: false
checks: null
index: null
coerce: false
strict: false
unique:
- col1
ordered: false
As you can see the unique: ["col1"]
attribute is passed down to Model2
.
I can't remember the exact technical reasons why Configs aren't actually inherited (via Python class inheritance semantics) but I suspect that it was easier to implement overriding behavior such that Config.<attributes>
defined in subclasses take precedence over those defined in parent classes.
Is there a particular reason you need this behavior? Maybe worth revisiting this.
Oh interesting. Thanks for your clear explanation replicated with the example.
My primary use-case came up in that I've built essentially a data hierarchy for the models expected at different points in our processing pipeline. I have basic models, then models that inherit from multiple models (ie. class Model3(Model1, Model2)
). I'd written a helper function to extract the uniqueness constraints for a given combination model that essentially would access Model3.Config.unique
to return the columns that are specified as being unique in that context. My unit test for this function kept failing because if I built Model3
like the above, Model3.Config.unique
returns a None
since Config
isn't explicitly defined in Model3
. I see now that if I do Model3.to_schema().unique
then it properly inherits ['col1']
as I was expecting.
However, it does seem that the .to_schema()
method just inherits the Config
from only the first class passed in. Does that seem correct? I based this on the following example tweaked from above:
from pandera.typing import Series
import pandera as pa
class Model1(pa.SchemaModel):
col1: Series[int]
class Config:
unique=['col1']
class Model2(pa.SchemaModel):
col2: Series[str]
class Config:
unique=['col2']
class Model3(Model1, Model2):
pass
If I do Model3.to_schema().unique
all I get back is ['col1']
. I'm guessing this is probably expected for simplicity as far as how to best inherit a Config from other classes?
It also seems that it just picks up the first Config that is defined and ignores empty Configs? I'm assuming this from the following example tacked on to the above example:
class Model4(pa.SchemaModel):
col4: Series[int]
class Model5(Model4, Model2, Model1):
pass
If I do Model5.to_schema.unique
here I get back ['col2']
. I was expecting to have a blank Config here since Model4 was first defined. However it seems since Model4
doesn't have a Config defined, it skipped until it found a defined Config with Model2 and picked up that Config.
I mostly bring this up as a question to better understand what Pandera
is doing and how it picks up different elements of SchemaModels from inheritance (Configs vs Fields). I don't know if this necessitates revisiting this, I mostly raised the question as it seems pandera
behaves slightly differently than what my expectations were (it would inherit the union of all parent models unless explicitly overwritten).
For my specific use-case, what you outlined about using .to_schema()
would probably work just fine. However, at some point for my project we may want to be combining models with different unique-ness constraints, so with that in mind I will probably be more explicit in these inheritance cases and I can combine unique elements from the Parent models explicitly in my code to avoid any confusion there like I showed in the original example (ie. Model3.Config = Model1.Config.unique + Model2.Config.unique
). This was helpful to explore either way for me.
I think the behavior you're seeing can be explained with Python inheritance... my memory's a little rusty here, but I believe Python will inherit attributes based on the order in which they're specified in the class MyClass(...)
definition, with parent classes earlier in the class definition taking precedence.
If I do Model3.to_schema().unique all I get back is ['col1']
If I'm correct in the above statement, this is expected... Python won't automagically merge the two Config definitions, pandera would have to implement that.
If I do Model5.to_schema.unique here I get back ['col2']. I was expecting to have a blank Config here since Model4
So since Model4
doesn't define a Config
class defined inside, Python will use the one in Model2
. Again, I believe this is how Python inheritence works.
For my specific use-case, what you outlined about using .to_schema() would probably work just fine
Cool, I'll close this issue for now, please feel free to open a new one describing the precise behavior you'd want, ideally with code examples as you've done in this thread.
It seems that Configs on SchemaModels are not inherited. Is this by design?
I've built some pandera schema models that inherit from one another, but it seems that pandera SchemaModels don't inherit the Config from one another. Is this by design or am I doing something wrong?
For example:
With the above SchemaModels defined, I would expect that Model2 would also have an attribute of Model2.Config.unique that was equal to ['col1'] but that doesn't seem to get inherited. For every inherited subclass is it expected to re-define the Config and re-inherit what is defined in the parent classes?
My current work around which works ok is
However, I wasn't sure if there was additional metadata that gets initialized in a blank Config that I'm overwriting by doing this or if there would be a better practice to better inherit if there is a design reason to not have the config get inherited by subsequent schemas. This pattern just went against my assumptions in building out mixtures of pandera schemas for my data.
Note that I posted this on Stack Overflow with the suggestion being to raise an issue/question on the repo itself.