Setting strict to false in SchemaModel does not ignore unspecifed columns in schema

dash-samuel commented 3 years ago

Describe the bug Hi everyone, firstly thanks a lot for working on this library, it is indeed very useful!

I have discovered that validation of a data frame without a column in a schema specified using a SchemaModel, with strict set to False in the Config still fails with the error: pandera.errors.SchemaError: column 'x' not in dataframe.

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series

class InputSchema(pa.SchemaModel):
    a: Series[int] = pa.Field()
    b: Series[int] = pa.Field()

    class Config:
        name = "BaseSchema"
        strict = False

df = pd.DataFrame({
    "a": ["2001", "2002", "2003"],
})

InputSchema.validate(df)

Expected behavior

According to the documentation I would expect that setting strict=False within a Schema Model would mean that columns not specified in the schema are not checked, or is this something that is only made available in the object based API with required=False ? If so then apologies in advance.

Desktop (please complete the following information):

OS: Ubuntu 20

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

jeffzi commented 3 years ago

@dash-samuel Thanks for your feedback.

Actually, you need to use typing.Optional to express that a column is not required. The default is that all columns are required, similarly to the regular pandera api. See https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#required

I realize that the documentation of SchemaModel does not mention how to make columns optional. @cosmicBboy I'll submit a fix for it.

from typing import Optional

import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series

class InputSchema(pa.SchemaModel):
    a: Series[int]
    b: Optional[Series[int]] 

    class Config:
        name = "BaseSchema"

df = pd.DataFrame(
    {
        "a": ["2001", "2002", "2003"],
    }
)

InputSchema.validate(df)

Notes:

Validation still fails because the column a contains strings, not integers.
You can omit pa.Field() if you don't need extra options or checks.
strict is False by default.

dash-samuel commented 3 years ago

@jeffzi thank you very much for clarifying this, indeed this wasn't immediately clear to me when reading the documentation, adding that to it would definitely make it easier for users!

I am posting some example code again for the same use case with the applied fixes in the case where:

a and b are part of the schema.
b is optional.
There are no further checks on the columns.

import pandas as pd
import pandera as pa
from typing import Optional
from pandera.typing import Index, DataFrame, Series, String

class InputSchema(pa.SchemaModel):
    a: Series[String]
    b: Optional[Series[int]]

    class Config:
        name = "BaseSchema"

df = pd.DataFrame({
    "a": ["2001", "2002", "2003"],
})

InputSchema.validate(df)

unionai-oss / pandera