unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.37k stars 310 forks source link

Best way to check if one of the columns present or not #749

Closed adityaguru149 closed 2 years ago

adityaguru149 commented 2 years ago

Question about pandera

Note: If you'd still like to submit a question, please read this guide detailing how to provide the necessary information for us to reproduce your question.

Use case - User provides a group_by column, code needs to groupby that column (at least one column is user supplied rest can be considered fixed) and then aggregate on another column
ex- groupby id1 and either of id2 or id3 and aggregate on data

Issue very similar to this issue in pydantic

At present, I have coded it as the following (def atleast_one_from_column_options_present_check)

from typing import Optional
import pandas as pd
import pandera as pa

class DFSchema(pa.SchemaModel):
    id1: pa.typing.Series[str]
    id2: Optional[pa.typing.Series[str]]
    id3: Optional[pa.typing.Series[str]]
    data: pa.typing.Series[float]

    @pa.dataframe_check
    def atleast_one_from_column_options_present_check(cls, df: pd.DataFrame) -> bool:
        column_options = {"id1", "id2"}
        columns_found = column_options.intersection(df.columns)
        return len(columns_found) > 0

df = pd.DataFrame({"id1": ["a", "b"],
                   "id3": ["c", "d"],
                   "data": [1.6, 2.5]})
DFSchema.validate(df, lazy=True)
print(df.head())
df = pd.DataFrame({"id1": ["a", "b"],
                   "id30": ["c", "d"],
                   "data": [1.6, 2.5]})
DFSchema.validate(df, lazy=True)
print(df.head())

Is there a better method? pandera checks? decorators?
How do I show the column_options (none of which is present) in Error Message?

Can this be taken up as a feature request to add it as a generic decorator function that can be used on schemas or schema models?

cosmicBboy commented 2 years ago

hi @adityaguru149 good question!

How do I show the column_options (none of which is present) in Error Message?

You can set _column_options as a private attribute, which SchemaModel ignores, so you can store arbitrary metadata there.

class DFSchema(pa.SchemaModel):
    id1: pa.typing.Series[str]
    id2: Optional[pa.typing.Series[str]]
    id3: Optional[pa.typing.Series[str]]
    data: pa.typing.Series[float]

    # private attributes can contain arbitrary metadata
    _column_options = {"id1", "id2"}

    @pa.dataframe_check(
        # error keyword arg gives you custom error messages
        error=f"does not contain at least one of {_column_options}"
    )
    def atleast_one_from_column_options_present_check(cls, df: pd.DataFrame) -> bool:
        columns_found = cls._column_options.intersection(df.columns)
        return len(columns_found) > 0

The error summary looks like this:

Error Counts
------------
- column_not_in_dataframe: 1
- dataframe_check: 1

Schema Error Summary
--------------------
                                                                       failure_cases  n_failure_cases
schema_context  column check
DataFrameSchema <NA>   column_in_dataframe                                     [id1]                1
                       does not contain at least one of {'id2', 'id1'}       [False]                1