Open phillies opened 1 year ago
Short amendment:
switching int
to float
and the coercion passes successfully. Switching the column type from float
to pandas.Int64Dtype
which supports NA and ints, leads to the same error.
Hi there, I just encountered the same issue using DataFrameSchema.
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
{
"name": pa.Column(str),
"count": pa.Column(int, nullable=True),
},
coerce=True
)
df = pd.DataFrame([{"name":"a", "count":1}, {"name":"b", "count":None}])
schema.validate(df, lazy=True)
results in
pandera.errors.SchemaErrors: Schema None: A total of 2 schema errors were found.
Error Counts
------------
- SchemaErrorReason.SCHEMA_COMPONENT_CHECK: 2
Schema Error Summary
--------------------
failure_cases n_failure_cases
schema_context column check
Column count coerce_dtype('int64') [nan] 1
dtype('int64') [float64] 1
EDIT: nevermind, using pd.Int64Dtype
as column type works for DataFrameSchema
s. Sorry for the noise.
I have looked in this as well and it seems like pandera is trying to coerce the row using the Pydantic Model
Similar to, in your case..
class DemoModel(SQLModel):
name: str
count: Optional[int]
df = pd.DataFrame([{"name":"a", "count":1}, {"name":"b", "count":None}])
DemoModel(**df.loc[1])
and since the type of it is float('nan')
it would fail... because its not in..
A ugly workaround would be..
class DemoModel(SQLModel):
name: str
count: Optional[int]
@pydantic.validator("count", pre=True)
def validate_nan_to_none(cls, value) -> Optional[int]:
if value:
return None if np.isnan(value) else value
I dont think that changing the type of count to Int64
would help in your case, first when you create the DF the assume dtype for count is float
(because of the none).
Second, if you would change the dtype manually to Int64
using astype
or something it would still fail when Pydantic tries to validate it..
Its a rabbit hole the way I see it...
I think the PydanticModel
as it is is very limited, it just tries to create an instance of each row, so pandera has no control on the validation process.
This is why the Int64Dtype
doesn't work, the coercion works at pydantic level, so you have to write a custom validator to coerce it.
The whole pydanticmodel support should be revised to use introspection and translate the pydantic fields to actual pandera.Column
s. Obviously custom validators can be tricky to convert, but they can apply
ed after pandera's coercion and type validation.
In the meantime, I'll stick with DataFrameSchema
s (as I clarified in my previous post, they work with Int64Dtype just fine).
I think the
PydanticModel
as it is is very limited, it just tries to create an instance of each row, so pandera has no control on the validation process.This is why the
Int64Dtype
doesn't work, the coercion works at pydantic level, so you have to write a custom validator to coerce it.The whole pydanticmodel support should be revised to use introspection and translate the pydantic fields to actual
pandera.Column
s. Obviously custom validators can be tricky to convert, but they canapply
ed after pandera's coercion and type validation.In the meantime, I'll stick with
DataFrameSchema
s (as I clarified in my previous post, they work with Int64Dtype just fine).
yea I think that #802 would probably address most of these cases better
Describe the bug A model with a nullable int field raises a SchemaError when coercing to a pandas DataFrame with NaN entries. NaN should be accepted as null.
Code Sample, a copy-pastable example
Result:
Expected behavior
Printing the dataframe:
Then the DataFrame should pass the coercion.
Desktop (please complete the following information):