unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

@pa.dataframe_check fails to identify issues in rows with NAs #516

Closed tfwillems closed 3 years ago

tfwillems commented 3 years ago

Describe the bug When using the @pa.dataframe_check for a custom SchemaModel, validation is not correctly performed on rows if their any of their columns contains NAs/None. These columns need not be defined in the SchemaModel and need not be referenced with the decorated validation function.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandera as pa
import pandas as pd

class TestSchema(pa.SchemaModel):
    x: pa.typing.Series[int] = pa.Field(nullable=False)

    @pa.dataframe_check
    def fail_all(cls, df: pd.DataFrame) -> pa.typing.Series[bool]:
        # Use a dataframe validator that should fail every row to demonstrate the issue
        return df["x"].map(lambda x: False)

    class Config:
        strict = False # Allow extra columns

df1 = pd.DataFrame({"x": [1, 2, 3], "y": [None, None, None]})
TestSchema.validate(df1)
# Passes validation even though each row should fail. Likely occurs b/c y=None for each row

df2 = pd.DataFrame({"x": [1, 2, 3], "y": [None, 0, None]})
TestSchema.validate(df2)
# Only flags row at index 1 as failing. Other rows with y = None are not flagged
# failure cases:
# column  index  failure_case
# 0      x      1           2.0
# 1      y      1           0.0

df3 = pd.DataFrame({"x": [1, 2, 3], "y": [0, 0, 0]})
TestSchema.validate(df3)
# Correctly flags all 3 rows as failing b/c y is no longer None/NA
# failure cases:
  column  index  failure_case
0      x      0             1
1      x      1             2
2      x      2             3
3      y      0             0
4      y      1             0
5      y      2             0

Expected behavior

In the example above, I expected df1, df2 and df3 to result in the same validation failure cases, with all 3 rows resulting in errors. Column y should have no bearing on the validation results as it's not referenced within fail_all or the schema.

With the current behavior, @dataframe_check validators are likely ignoring every row with an NA in any column

Desktop (please complete the following information):

Screenshots

NA

Additional context

I suspect this is somehow related to a pandas.groupby, in which dropna defaults to True and thus any rows with an NA will be ignored.

tfwillems commented 3 years ago

Looks like I missed the mark here ... adding ignore_na=True to @pa.dataframe_check solved all of my issues.