unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.29k stars 307 forks source link

Custom checks lost after to_yaml #929

Open hebrd opened 2 years ago

hebrd commented 2 years ago

Describe the bug A clear and concise description of what the bug is.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
import pandera as pa
def low_lt_high(df):
    return df['low'] <= df['high']

schema = pa.DataFrameSchema(
    columns={"close": pa.Column(float, checks=[pa.Check.gt(0.0), ])},
    checks=[pa.Check(low_lt_high)]
)
print(schema.to_yaml())

Expected behavior

Keep checks rules in yaml so they can be loaded again.

Desktop (please complete the following information):

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

kylejcaron commented 1 month ago

Any plans for this? I'm also running into this problem, I have modular schemas that inherit from eachother and want to add another easy option for 3rd party users to get the yaml info from a schema and be able to see everything in 1 place - its a massive convenience to help adoption

Here's a more simple example without inheritance

import pandera as pa
import pandera.extensions as extensions

@extensions.register_check_method(statistics=["cls"])
def non_null_values_in_extra_columns(df, cls):
    """This function checks any column not specified in the schema and makes sure that its not null."""
    # Get the columns defined in the schema
    defined_columns = cls.to_schema().columns.keys()

    # Find columns in the DataFrame that are not defined in the schema
    extra_columns = [col for col in df.columns if col not in defined_columns]

    # Check that all values in these extra columns are not null
    return df[extra_columns].notnull().all().all()

class TestSchema(pa.DataFrameModel):

    @pa.dataframe_check
    def check_non_null_values_in_extra_columns(cls, df):
        res = pa.Check.non_null_values_in_extra_columns(cls)(df)
        return res.check_passed

print(TestSchema.to_yaml())

the expected behavior would be for the registered method non_null_values_in_extra_columns to show up in the yaml output. If I include the check in the Config it will show up in the yaml output, but there would be no way to reference cls that way, and with inheritance I would have to restate all of the Config settings that were inherited or else they would get overwritten (as far as I can tell, theres no way to append to a config apart from the metadata)

kylejcaron commented 1 month ago

@cosmicBboy hope you don't mind me tagging you, but was wondering if you had any feedback or thoughts on this issue?