Open benlindsay opened 2 years ago
Hey, @benlindsay this proposal looks good to me!
One minor naming/API change I'd suggest is perhaps to do Field(on_missing="warn")
and on_missing_columns="warn"
for the schema model config. I think this is nice because actually now on_missing
and on_missing_columns
can be refactored to accept callback functions (some time in the future of course!).
For the scope of this issue "warn"
would be the only available option, but the direction the library is going I'd like to add customizability via callbacks, and the use case of doing something when a column is missing is a perfect case to start this pattern.
Feel free to make a PR for this issue. A few things to keep in mind when building this out:
pandera.schemas.DataFrameSchema
and pandera.models.SchemaModel
would need to be updated in tandem to support the the dataframe-level optionpandera.schema_components.Column
and pandera.model_components.Field
would similarly need to be updated in tandem.Let me know if you have any questions!
I like that API change suggestion, thanks for the feedback! I'd love to work on this, but don't foresee myself being able to commit the time to do so in the near future. I'll make a PR if that changes some day, but in the meantime if anyone else has the time and interest in doing this, please go for it.
Thanks!
cool! just added the "help wanted" tag on this issue.
In the mean time, here's a perhaps useful extension to the code snippet you provided:
class UserSchema(pa.SchemaModel):
id: Series[int]
name: Series[str]
warn_me_if_missing: Optional[Series[str]]
# pandera ignores private class attributes when gathering column/index fields
_warn_if_missing_columns = [
"warn_me_if_missing",
...
]
@pa.dataframe_check
def warn_if_missing_columns(cls, df: pd.DataFrame) -> bool:
# access the private class attribute here
for col in cls._warn_if_missing_columns:
if col not in df.columns:
print(f"WARNING: '{col}' column not present")
return True
Nice, thanks for the tag and the hints! Maybe I can add that dataframe check to my BaseSchema
mentioned here with _warn_if_missing_columns = []
which can be overridden by subclasses
There are times when it's not clear if a column is necessary or not, so I don't want to force it to exist, but I'd also like to be able see a warning if that column isn't present to help debug long pipelines. I can do this with something like this:
But it would be nice to be able to simplify the schema with syntax like this:
Additionally, it might be nice to have a schema-wide config option like this:
But I think being able to turn that on and off on a per-column basis would be nice, since there might be some optional columns you don't want a warning about and some you do. Open to whatever naming and syntax makes the most sense. Mostly just hoping to make raising warnings take a little less code and remain a little more DRY