Columns containing `bool` and `None` values do not validate correctly

unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

https://www.union.ai/pandera

MIT License

3.29k stars 307 forks source link

Columns containing `bool` and `None` values do not validate correctly #1807

Open xvr-hlt opened 2 weeks ago

xvr-hlt commented 2 weeks ago

Describe the bug

If I create a pa.DataFrameSchema with a pa.Column(bool, nullable=True), I expect something of the form [None, True] to pass validation, but it does not.

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandera.
[ ] (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa

pa.DataFrameSchema({'x': pa.Column(bool, nullable=True)})(pd.DataFrame({'x': [True, None]}))
>>> ...
>>> SchemaError: expected series 'x' to have type bool, got object

Expected behavior

This should pass validation.

Nick-Seinsche commented 2 weeks ago

Hi @xvr-hlt, this is not a bug. python None is not a bool, therefore pandas converts that series to object which causes the schema validation to fail. Instead you may want to use the pandas nullable boolean dtype:

pa.DataFrameSchema({"x": pa.Column(pd.BooleanDtype, nullable=True)})(
    pd.DataFrame({"x": [True, pd.NA]}, dtype="boolean")
)

edit: You can also replace pd.NA by None because you give the dtype here explicitly and pandas converts None to pd.NA for you.

xvr-hlt commented 2 weeks ago

I understand that None is not a bool, what was confusing to me is that None was invalid for a field with nullable = True.

Additionally, this behaviour is inconsistent: with a str field, None is valid input where nullable=True:

import pandas as pd
import pandera as pa

pa.DataFrameSchema({'x': pa.Column(str, nullable=True)})(pd.DataFrame({'x': ["abc", None]}))

Passes without fail.

Nick-Seinsche commented 1 week ago

Regarding your second point: Yes, this again is due to pandas. The series of your DataFrame is dtype object. Both "abc" and None are objects and since nullable=True, None is allowed so the test passes.

Regading your first point: The test doesnt fail because nullable = True "doesnt work". Its because you specify the column to be dtype bool, but the dataframe you pass into the schema check has column dtype object so the validation fails.

cosmicBboy commented 22 hours ago

@xvr-hlt dealing with null with the default numpy types is a pain, I'd recommend using the pandas-native nullable dtype:

pandera's design choice is to delegate behavior to the underlying dataframe library, in this case it inherits the datatype behavior of pandas: have a boolean and None value in a column will be interpreted by pandas as having an object dtype.