Open JohannHansing opened 1 week ago
Another weird and perhaps related observations:
This produces a red unit test result, since the invalid rows are not dropped:
(
DataFrameSchema(
{
"c": Column(int, checks=[Check(lambda x: x >= 3)]),
},
drop_invalid_rows=True,
),
pd.DataFrame({"c": [1, 2, 3, 4, 5, 6],}),
pd.DataFrame({"c": [3, 4, 5, 6]}),
),
],
)
def test_drop_invalid_for_dataframe_schema(schema, obj, expected_obj):
But the following doesn't, where I only replaced "c" with "numbers"
(
DataFrameSchema(
{
"numbers": Column(int, checks=[Check(lambda x: x >= 3)]),
},
drop_invalid_rows=True,
),
pd.DataFrame({"numbers": [1, 2, 3, 4, 5, 6],}),
pd.DataFrame({"numbers": [3, 4, 5, 6]}),
),
],
)
def test_drop_invalid_for_dataframe_schema(schema, obj, expected_obj):
Describe the bug
This is my first bug in report in an open source repo, so I apologize in advance if it's not done adequately.
The flag
Column(drop_invalid_rows=True)
has no effect when validating pandas dataframes. This can readily be observed in the documentation:https://pandera.readthedocs.io/en/stable/drop_invalid_rows.html
Where in the given example with
df = pd.DataFrame({"counter": ["1", "2", "3"]})
no row is actually dropped, even though there should be validation errors, since the values are string type and not integer type.During debugging, I found that the bug occurs due to fact that in
pandera/backends/pandas/container.py
, the following if-clause on line 118 evaluates toFalse
if drop_invalid_rows is set to True:This in turn seems to be caused by the fact that pandera internally communicates validation errors by collecting exceptions via try-except clauses but when
drop_invalid_rows
is set to True, no exceptions are raised, which is whybool(error_handler.collected_errors)
evaluates toFalse
.If
drop_invalid_rows
were not set to true, then the validation errors would have raised exceptions inpandera/backends/pandas/array.py
inArraySchemaBackend.validate
which in turn would have been collected in the try-except block inpandera/backends/pandas/container.py
inrun_schema_component_checks
.To fix this bug, I would humbly suggest considering refactoring the code so that it does not communicate via try-except statements. Validation errors should be collected into e.g. lists and these lists passed between functions. Exceptions should only be raised if lazy=False and not be used to pass data between functions.
Update: Further debugging on the main branch of
pandera
led me to realize that the bug does not occur forDataFrameSchema(drop_invalid_rows=True)
. Which is why the unit tests fordrop_invalid_rows=True
are green in test_schemas, wheredrop_invalid_rows
is passed as an argument toDataFrameSchema
and not toColumn
.Code Sample, a copy-pastable example
Expected behavior
Desktop (please complete the following information):
Windows 10