Open xvr-hlt opened 2 weeks ago
Hi @xvr-hlt, this is not a bug. python None is not a bool, therefore pandas converts that series to object which causes the schema validation to fail. Instead you may want to use the pandas nullable boolean dtype:
pa.DataFrameSchema({"x": pa.Column(pd.BooleanDtype, nullable=True)})(
pd.DataFrame({"x": [True, pd.NA]}, dtype="boolean")
)
edit: You can also replace pd.NA by None because you give the dtype here explicitly and pandas converts None to pd.NA for you.
I understand that None
is not a bool
, what was confusing to me is that None
was invalid for a field with nullable = True
.
Additionally, this behaviour is inconsistent: with a str
field, None
is valid input where nullable=True
:
import pandas as pd
import pandera as pa
pa.DataFrameSchema({'x': pa.Column(str, nullable=True)})(pd.DataFrame({'x': ["abc", None]}))
Passes without fail.
Regarding your second point: Yes, this again is due to pandas. The series of your DataFrame is dtype object. Both "abc" and None are objects and since nullable=True, None is allowed so the test passes.
Regading your first point: The test doesnt fail because nullable = True "doesnt work". Its because you specify the column to be dtype bool, but the dataframe you pass into the schema check has column dtype object so the validation fails.
@xvr-hlt dealing with null with the default numpy types is a pain, I'd recommend using the pandas-native nullable dtype:
pandera's design choice is to delegate behavior to the underlying dataframe library, in this case it inherits the datatype behavior of pandas: have a boolean and None value in a column will be interpreted by pandas as having an object
dtype.
Describe the bug
If I create a
pa.DataFrameSchema
with apa.Column(bool, nullable=True)
, I expect something of the form[None, True]
to pass validation, but it does not.Code Sample, a copy-pastable example
Expected behavior
This should pass validation.