unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Setting strict to false in SchemaModel does not ignore unspecifed columns in schema #361

Closed dash-samuel closed 3 years ago

dash-samuel commented 3 years ago

Describe the bug Hi everyone, firstly thanks a lot for working on this library, it is indeed very useful!

I have discovered that validation of a data frame without a column in a schema specified using a SchemaModel, with strict set to False in the Config still fails with the error: pandera.errors.SchemaError: column 'x' not in dataframe.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series

class InputSchema(pa.SchemaModel):
    a: Series[int] = pa.Field()
    b: Series[int] = pa.Field()

    class Config:
        name = "BaseSchema"
        strict = False

df = pd.DataFrame({
    "a": ["2001", "2002", "2003"],
})

InputSchema.validate(df)

Expected behavior

According to the documentation I would expect that setting strict=False within a Schema Model would mean that columns not specified in the schema are not checked, or is this something that is only made available in the object based API with required=False ? If so then apologies in advance.

Desktop (please complete the following information):

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

jeffzi commented 3 years ago

@dash-samuel Thanks for your feedback.

Actually, you need to use typing.Optional to express that a column is not required. The default is that all columns are required, similarly to the regular pandera api. See https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#required

I realize that the documentation of SchemaModel does not mention how to make columns optional. @cosmicBboy I'll submit a fix for it.

from typing import Optional

import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series

class InputSchema(pa.SchemaModel):
    a: Series[int]
    b: Optional[Series[int]] 

    class Config:
        name = "BaseSchema"

df = pd.DataFrame(
    {
        "a": ["2001", "2002", "2003"],
    }
)

InputSchema.validate(df)

Notes:

dash-samuel commented 3 years ago

@jeffzi thank you very much for clarifying this, indeed this wasn't immediately clear to me when reading the documentation, adding that to it would definitely make it easier for users!

I am posting some example code again for the same use case with the applied fixes in the case where:

import pandas as pd
import pandera as pa
from typing import Optional
from pandera.typing import Index, DataFrame, Series, String

class InputSchema(pa.SchemaModel):
    a: Series[String]
    b: Optional[Series[int]]

    class Config:
        name = "BaseSchema"

df = pd.DataFrame({
    "a": ["2001", "2002", "2003"],
})

InputSchema.validate(df)