unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.37k stars 310 forks source link

Enable raising warning if optional column is missing #706

Open benlindsay opened 2 years ago

benlindsay commented 2 years ago

There are times when it's not clear if a column is necessary or not, so I don't want to force it to exist, but I'd also like to be able see a warning if that column isn't present to help debug long pipelines. I can do this with something like this:

from typing import Optional

import pandera as pa
import pandas as pd
from pandera.typing import Series

class UserSchema(pa.SchemaModel):
    id: Series[int]
    name: Series[str]
    warn_me_if_missing: Optional[Series[str]]

    @pa.dataframe_check
    def warn_if_missing_columns(cls, df: pd.DataFrame) -> bool:
        if "warn_me_if_missing" not in df.columns:
            print("WARNING: 'warn_me_if_missing' column not present")
        return True

input_df = pd.DataFrame({"id": [0, 1], "name": ["Bob", "Alice"]})
df = UserSchema.validate(input_df)
print(df)

But it would be nice to be able to simplify the schema with syntax like this:

class UserSchema(pa.SchemaModel):
    id: Series[int]
    name: Series[str]
    warn_me_if_missing: Optional[Series[str]] = pa.Field(warn_if_missing=True)

Additionally, it might be nice to have a schema-wide config option like this:

class UserSchema(pa.SchemaModel):
    id: Series[int]
    name: Series[str]
    warn_me_if_missing: Optional[Series[str]]

    class Config:
        warn_if_missing_columns = True

But I think being able to turn that on and off on a per-column basis would be nice, since there might be some optional columns you don't want a warning about and some you do. Open to whatever naming and syntax makes the most sense. Mostly just hoping to make raising warnings take a little less code and remain a little more DRY

cosmicBboy commented 2 years ago

Hey, @benlindsay this proposal looks good to me!

One minor naming/API change I'd suggest is perhaps to do Field(on_missing="warn") and on_missing_columns="warn" for the schema model config. I think this is nice because actually now on_missing and on_missing_columns can be refactored to accept callback functions (some time in the future of course!).

For the scope of this issue "warn" would be the only available option, but the direction the library is going I'd like to add customizability via callbacks, and the use case of doing something when a column is missing is a perfect case to start this pattern.

Feel free to make a PR for this issue. A few things to keep in mind when building this out:

Let me know if you have any questions!

benlindsay commented 2 years ago

I like that API change suggestion, thanks for the feedback! I'd love to work on this, but don't foresee myself being able to commit the time to do so in the near future. I'll make a PR if that changes some day, but in the meantime if anyone else has the time and interest in doing this, please go for it.

Thanks!

cosmicBboy commented 2 years ago

cool! just added the "help wanted" tag on this issue.

In the mean time, here's a perhaps useful extension to the code snippet you provided:

class UserSchema(pa.SchemaModel):

    id: Series[int]
    name: Series[str]
    warn_me_if_missing: Optional[Series[str]]

    # pandera ignores private class attributes when gathering column/index fields
    _warn_if_missing_columns = [
        "warn_me_if_missing",
        ...
    ]

    @pa.dataframe_check
    def warn_if_missing_columns(cls, df: pd.DataFrame) -> bool:
        # access the private class attribute here
        for col in cls._warn_if_missing_columns:
            if col not in df.columns:
                print(f"WARNING: '{col}' column not present")
        return True
benlindsay commented 2 years ago

Nice, thanks for the tag and the hints! Maybe I can add that dataframe check to my BaseSchema mentioned here with _warn_if_missing_columns = [] which can be overridden by subclasses