unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.33k stars 309 forks source link

pa.Column(drop_invalid_rows=True) has no effect for Pandas DataFrames #1830

Open JohannHansing opened 1 week ago

JohannHansing commented 1 week ago

Describe the bug

This is my first bug in report in an open source repo, so I apologize in advance if it's not done adequately.

The flag Column(drop_invalid_rows=True) has no effect when validating pandas dataframes. This can readily be observed in the documentation:

https://pandera.readthedocs.io/en/stable/drop_invalid_rows.html

Where in the given example with df = pd.DataFrame({"counter": ["1", "2", "3"]}) no row is actually dropped, even though there should be validation errors, since the values are string type and not integer type.

During debugging, I found that the bug occurs due to fact that in pandera/backends/pandas/container.py , the following if-clause on line 118 evaluates to False if drop_invalid_rows is set to True:

        if error_handler.collected_errors:
            if getattr(schema, "drop_invalid_rows", False):
                check_obj = self.drop_invalid_rows(check_obj, error_handler)
                return check_obj
            else:
                raise SchemaErrors(
                    schema=schema,
                    schema_errors=error_handler.schema_errors,
                    data=check_obj,
                )

This in turn seems to be caused by the fact that pandera internally communicates validation errors by collecting exceptions via try-except clauses but when drop_invalid_rows is set to True, no exceptions are raised, which is why bool(error_handler.collected_errors) evaluates to False.

If drop_invalid_rows were not set to true, then the validation errors would have raised exceptions in pandera/backends/pandas/array.py in ArraySchemaBackend.validate which in turn would have been collected in the try-except block in pandera/backends/pandas/container.py in run_schema_component_checks.

To fix this bug, I would humbly suggest considering refactoring the code so that it does not communicate via try-except statements. Validation errors should be collected into e.g. lists and these lists passed between functions. Exceptions should only be raised if lazy=False and not be used to pass data between functions.

Update: Further debugging on the main branch of pandera led me to realize that the bug does not occur for DataFrameSchema(drop_invalid_rows=True) . Which is why the unit tests for drop_invalid_rows=True are green in test_schemas, where drop_invalid_rows is passed as an argument to DataFrameSchema and not to Column.

Code Sample, a copy-pastable example

schema_test = pa.DataFrameSchema({"c": pa.Column(str, drop_invalid_rows=True, checks=pa.Check.str_length(max_value=5))})
df_result = schema_test.validate(pd.DataFrame({"c": ["this string is too long", "fine"]}), lazy=True)
>>> print(df_result)
                         c
0  this string is too long
1                     fine

Expected behavior

>>> print(df_result)
                         c
1                     fine

Desktop (please complete the following information):

Windows 10

JohannHansing commented 6 days ago

Another weird and perhaps related observations:

This produces a red unit test result, since the invalid rows are not dropped:

        (
            DataFrameSchema(
                {
                    "c": Column(int, checks=[Check(lambda x: x >= 3)]),
                },
                drop_invalid_rows=True,
            ),
            pd.DataFrame({"c": [1, 2, 3, 4, 5, 6],}),
            pd.DataFrame({"c": [3, 4, 5, 6]}),
        ),
    ],
)
def test_drop_invalid_for_dataframe_schema(schema, obj, expected_obj):

But the following doesn't, where I only replaced "c" with "numbers"

        (
            DataFrameSchema(
                {
                    "numbers": Column(int, checks=[Check(lambda x: x >= 3)]),
                },
                drop_invalid_rows=True,
            ),
            pd.DataFrame({"numbers": [1, 2, 3, 4, 5, 6],}),
            pd.DataFrame({"numbers": [3, 4, 5, 6]}),
        ),
    ],
)
def test_drop_invalid_for_dataframe_schema(schema, obj, expected_obj):