unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Nullable int not correctly recognized from pandas dataframe with NaN #1161

Open phillies opened 1 year ago

phillies commented 1 year ago

Describe the bug A model with a nullable int field raises a SchemaError when coercing to a pandas DataFrame with NaN entries. NaN should be accepted as null.

Code Sample, a copy-pastable example

from typing import Optional
import numpy as np
import pandas as pd
from sqlmodel import SQLModel
from pandera import SchemaModel, Field
from pandera.engines.pandas_engine import PydanticModel
from pandera.typing import DataFrame

class DemoModel(SQLModel):
    name: str
    count: Optional[int]

class PanderaModel(SchemaModel):
    class Config:
        dtype = PydanticModel(DemoModel)
        coerce = True

df = pd.DataFrame([{"name":"a", "count":1}, {"name":"b", "count":None}])
print(df)

DataFrame[PanderaModel](df)

Result:

SchemaError: Error while coercing 'PanderaModel' to type : Could not coerce  data_container into type 
   index    failure_case
0      1  {'count': nan}

Expected behavior

Printing the dataframe:

  name  count
0    a    1.0
1    b    NaN

Then the DataFrame should pass the coercion.

Desktop (please complete the following information):

phillies commented 1 year ago

Short amendment: switching int to float and the coercion passes successfully. Switching the column type from float to pandas.Int64Dtype which supports NA and ints, leads to the same error.

sanzoghenzo commented 1 year ago

Hi there, I just encountered the same issue using DataFrameSchema.

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    {
        "name": pa.Column(str),
        "count": pa.Column(int, nullable=True),
    },
    coerce=True
)

df = pd.DataFrame([{"name":"a", "count":1}, {"name":"b", "count":None}])

schema.validate(df, lazy=True)

results in

pandera.errors.SchemaErrors: Schema None: A total of 2 schema errors were found.
Error Counts
------------
- SchemaErrorReason.SCHEMA_COMPONENT_CHECK: 2
Schema Error Summary
--------------------
                                            failure_cases  n_failure_cases
schema_context column check                                               
Column         count  coerce_dtype('int64')         [nan]                1
                      dtype('int64')            [float64]                1

EDIT: nevermind, using pd.Int64Dtype as column type works for DataFrameSchemas. Sorry for the noise.

arnoin commented 1 year ago

I have looked in this as well and it seems like pandera is trying to coerce the row using the Pydantic Model

Similar to, in your case..

class DemoModel(SQLModel):
    name: str
    count: Optional[int]

df = pd.DataFrame([{"name":"a", "count":1}, {"name":"b", "count":None}])
DemoModel(**df.loc[1])

and since the type of it is float('nan') it would fail... because its not in.. A ugly workaround would be..

class DemoModel(SQLModel):
    name: str
    count: Optional[int]

    @pydantic.validator("count", pre=True)
    def validate_nan_to_none(cls, value) -> Optional[int]:
        if value:
            return None if np.isnan(value) else value

I dont think that changing the type of count to Int64 would help in your case, first when you create the DF the assume dtype for count is float (because of the none). Second, if you would change the dtype manually to Int64 using astype or something it would still fail when Pydantic tries to validate it.. Its a rabbit hole the way I see it...

sanzoghenzo commented 1 year ago

I think the PydanticModel as it is is very limited, it just tries to create an instance of each row, so pandera has no control on the validation process.

This is why the Int64Dtype doesn't work, the coercion works at pydantic level, so you have to write a custom validator to coerce it.

The whole pydanticmodel support should be revised to use introspection and translate the pydantic fields to actual pandera.Columns. Obviously custom validators can be tricky to convert, but they can apply ed after pandera's coercion and type validation.

In the meantime, I'll stick with DataFrameSchemas (as I clarified in my previous post, they work with Int64Dtype just fine).

arnoin commented 1 year ago

I think the PydanticModel as it is is very limited, it just tries to create an instance of each row, so pandera has no control on the validation process.

This is why the Int64Dtype doesn't work, the coercion works at pydantic level, so you have to write a custom validator to coerce it.

The whole pydanticmodel support should be revised to use introspection and translate the pydantic fields to actual pandera.Columns. Obviously custom validators can be tricky to convert, but they can apply ed after pandera's coercion and type validation.

In the meantime, I'll stick with DataFrameSchemas (as I clarified in my previous post, they work with Int64Dtype just fine).

yea I think that #802 would probably address most of these cases better