unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Incorrectly raised SchemaError when validating multiindex DataFrame #1131

Closed ErikLundin98 closed 1 year ago

ErikLundin98 commented 1 year ago

Given this minimal example

import pandas as pd
import pandera as pa

class MultiIndexTestSchema(pa.SchemaModel):
    boolean_index_one: pa.typing.Index[bool] = pa.Field(coerce=True)
    boolean_index_two: pa.typing.Index[bool] = pa.Field(coerce=True)
    value: pa.typing.Series[int] = pa.Field()

df = pd.DataFrame({
    "boolean_index_one": [True, False, True, True, False], 
    "boolean_index_two": [True, True, True, True, True],
    "value": [1, 2, 3, 4, 5],
})
df = df.set_index(keys=["boolean_index_one", "boolean_index_two"])
MultiIndexTestSchema.validate(df)

Expected behaviour is that the validation should pass, since the two index columns contain boolean fields.

However, instead I get the following error:

raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'boolean_index_one' to have type bool, got object

This issue seems to only occur with MultiIndex DataFrames and with boolean fields. Changing type from bool to int magically resolves the issue.

I would like to know if anyone knows a workaround for this, if I am misinterpreting anything about defining the schemas?

Thanks in advance!

I am using pandera version 0.14.4

cosmicBboy commented 1 year ago

hi @ErikLundin98 what version of pandas and python are you using?

ErikLundin98 commented 1 year ago

@cosmicBboy, I'm using python 3.10.10 and pandas 1.3.5

cosmicBboy commented 1 year ago

So unfortunately pandas 1.3.5 has a bunch of issues with index data types... see this StringDtype xfail test as an example: https://github.com/unionai-oss/pandera/blob/fe83c19a1aebb127f22e8bee849be70a1a96c33a/tests/core/test_schema_components.py#L838-L856

This is purely a pandas issue:

In [1]: import pandas as pd

In [2]: pd.Index([True, False])
Out[2]: Index([True, False], dtype='object')

In [3]: pd.Index([True, False], dtype=bool)
Out[3]: Index([True, False], dtype='object')  # it's still an "object"!

Any chance you can update your pandas version?

ErikLundin98 commented 1 year ago

Thank you for clarifying that it's a pandas issue! I will see if I can update pandas.