unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Remove columns on a SchemaModel #987

Closed a-recknagel closed 1 year ago

a-recknagel commented 1 year ago

Problem I use SchemaModels extensively to type-hint my function calls and have written some code which uses the annotations at run-time. At one point, I wanted to define a SchemaB which had dropped a bunch of columns from its parent SchemaA:

import pandera as pa
from pandera.typing import DataFrame, Series

class SchemaA(pa.SchemaModel):
    a: Series[int]
    b: Series[int]
    c: Series[int]
    d: Series[int]

class SchemaB(SchemaA):
    ...

cols_to_keep = ["a", "b", "c"]
def foo(data: DataFrame[SchemaA]) -> DataFrame[SchemaB]:
    return data[cols_to_keep]

Unless I'm a bit blind, there is no way to define SchemaB in a straight-forward manner.

Possible Solutions

Alternatives I've been trying to use DataFrameSchema, which supports removing columns, for SchemaB instead:

SchemaB = (df_schema_a := SchemaA.to_schema()).remove_columns(
    [col for col in df_schema_a.columns if col not in cols_to_keep]
)

But then typing doesn't work any more, because SchemaB isn't a type.

Additional Context I can't invert the parent/child relationship of SchemaA and SchemaB, because the actual parents of SchemaA are a bunch of other SchemaModels, and the columns-to-keep are split among them. I'd like to avoid not having SchemaB inheriting from SchemaA because of code duplication -- many of the columns have rather complex definitions.

cosmicBboy commented 1 year ago

hey @a-recknagel so unfortunately using SchemaModels in this way isn't possible because inheritence is only additive.

It's perhaps a little less intuitive, but if you want to rely on Python's inheritence semantics, I typically use this pattern:

import pandera as pa
from pandera.typing import DataFrame, Series

class BaseSchema(pa.SchemaModel):
    # put all the common columns here
    a: Series[int]
    b: Series[int]

class SchemaA(BaseSchema):
    c: Series[int]
    d: Series[int]

class SchemaB(BaseSchema):
    # suppose you drop but also add a bunch of columns. This is equivalent to
    # dropping "a" and "b" and adding "e" and "f"
    e: Series[int]
    f: Series[int]

This can get overly verbose, so an alternative would be to patch the to_schema method in SchemaB by calling the parent class to_schema() and then doing schema transformations on that.

import pandera as pa
from pandera.typing import DataFrame, Series

class SchemaA(pa.SchemaModel):
    a: Series[int]
    b: Series[int]
    c: Series[int]
    d: Series[int]

class SchemaB(SchemaA):
    ...

    @classmethod
    def to_schema(cls) -> pa.DataFrameSchema:
        schema = super().to_schema()
        return schema.remove_columns(["a", "b"])

print(SchemaB.to_schema())

# <Schema DataFrameSchema(
#     columns={
#         'c': <Schema Column(name=c, type=DataType(int64))>
#         'd': <Schema Column(name=d, type=DataType(int64))>
#     },
#     checks=[],
#     coerce=False,
#     dtype=None,
#     index=None,
#     strict=False
#     name=SchemaB,
#     ordered=False,
#     unique_column_names=False
# )>

Note, though, that with this method the SchemaB class will still have a and b as class attributes, but pandera always uses the result of to_schema to actually perform the validation.

Lemme know if this works for you!

Also, this question has come up before a few times, so probably worth adding a section in the docs about this... would you be interested in contributing a section on this page?

a-recknagel commented 1 year ago

The second option hopefully works for me, thanks! I have some multiple inheritances, and always forget how to make overriding methods cooperative. I'll try to add the docs, too.

cosmicBboy commented 1 year ago

Cool, I'm gonna convert this issue into a discussion... would you mind marking my response as the answer?