Describe the bug
When using the @pa.dataframe_check for a custom SchemaModel, validation is not correctly performed on rows if their any of their columns contains NAs/None. These columns need not be defined in the SchemaModel and need not be referenced with the decorated validation function.
[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandera.
[ ] (optional) I have confirmed this bug exists on the master branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import pandera as pa
import pandas as pd
class TestSchema(pa.SchemaModel):
x: pa.typing.Series[int] = pa.Field(nullable=False)
@pa.dataframe_check
def fail_all(cls, df: pd.DataFrame) -> pa.typing.Series[bool]:
# Use a dataframe validator that should fail every row to demonstrate the issue
return df["x"].map(lambda x: False)
class Config:
strict = False # Allow extra columns
df1 = pd.DataFrame({"x": [1, 2, 3], "y": [None, None, None]})
TestSchema.validate(df1)
# Passes validation even though each row should fail. Likely occurs b/c y=None for each row
df2 = pd.DataFrame({"x": [1, 2, 3], "y": [None, 0, None]})
TestSchema.validate(df2)
# Only flags row at index 1 as failing. Other rows with y = None are not flagged
# failure cases:
# column index failure_case
# 0 x 1 2.0
# 1 y 1 0.0
df3 = pd.DataFrame({"x": [1, 2, 3], "y": [0, 0, 0]})
TestSchema.validate(df3)
# Correctly flags all 3 rows as failing b/c y is no longer None/NA
# failure cases:
column index failure_case
0 x 0 1
1 x 1 2
2 x 2 3
3 y 0 0
4 y 1 0
5 y 2 0
Expected behavior
In the example above, I expected df1, df2 and df3 to result in the same validation failure cases, with all 3 rows resulting in errors. Column y should have no bearing on the validation results as it's not referenced within fail_all or the schema.
With the current behavior, @dataframe_check validators are likely ignoring every row with an NA in any column
Desktop (please complete the following information):
MacOS Mojave
chrome
Screenshots
NA
Additional context
I suspect this is somehow related to a pandas.groupby, in which dropna defaults to True and thus any rows with an NA will be ignored.
Describe the bug When using the @pa.dataframe_check for a custom SchemaModel, validation is not correctly performed on rows if their any of their columns contains NAs/None. These columns need not be defined in the SchemaModel and need not be referenced with the decorated validation function.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Expected behavior
In the example above, I expected df1, df2 and df3 to result in the same validation failure cases, with all 3 rows resulting in errors. Column y should have no bearing on the validation results as it's not referenced within fail_all or the schema.
With the current behavior, @dataframe_check validators are likely ignoring every row with an NA in any column
Desktop (please complete the following information):
Screenshots
NA
Additional context
I suspect this is somehow related to a pandas.groupby, in which dropna defaults to True and thus any rows with an NA will be ignored.