When validating a pandas dataframe schema that requires a MultiIndex with unique indexes (columns), I would expect pandera.errors.SchemaError: ... not unique exception to be raised by pandera. However, since pandas version 2.1.0, a pandas error happens during validation, so I receive a pandasValueError instead, which is not expected behavior.
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandera.
Code Sample, a copy-pastable example
import pandas as pd
import pandera as pa
df = pd.DataFrame(
{
"idx0": ["a", "a", "a"],
"idx1": [1, 2, 1],
"col1": [1.1, 1.2, 1.3],
}
)
df.set_index(keys=["idx0", "idx1"], drop=True, inplace=True)
print(df)
schema = pa.DataFrameSchema(
index=pa.MultiIndex(
[pa.Index(str, name="idx0"), pa.Index(int, name="idx1")], unique=["idx0", "idx1"]
)
)
print(schema)
schema.validate(df) # Error received at this step.
Expected behavior
I expect to receive the following error:
pandera.errors.SchemaError: columns '('idx0', 'idx1')' not unique:
idx0 idx1
idx0 idx1
a 1 a 1
1 a 1
However, I instead receive the following error (showing also traceback):
Traceback (most recent call last):
File "<LOCAL PATH>/example.py", line 20, in <module>
schema.validate(df)
File "<PATH TO PYTHON>/python3.9/site-packages/pandera/api/pandas/container.py", line 366, in validate
return self._validate(
File "<PATH TO PYTHON>/python3.9/site-packages/pandera/api/pandas/container.py", line 395, in _validate
return self.get_backend(check_obj).validate(
File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 97, in validate
error_handler = self.run_checks_and_handle_errors(
File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 151, in run_checks_and_handle_errors
results = check(*args) # type: ignore [operator]
File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 192, in run_schema_component_checks
result = schema_component.validate(
File "<PATH TO PYTHON>/python3.9/site-packages/pandera/api/pandas/components.py", line 489, in validate
return self.get_backend(check_obj).validate(
File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/components.py", line 436, in validate
validation_result = super().validate(
File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 97, in validate
error_handler = self.run_checks_and_handle_errors(
File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 151, in run_checks_and_handle_errors
results = check(*args) # type: ignore [operator]
File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 734, in check_column_values_are_unique
failure_cases = reshape_failure_cases(failure_cases)
File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/error_formatters.py", line 94, in reshape_failure_cases
failure_cases.rename_axis("column", axis=1) # type: ignore[call-overload]
File "<PATH TO PYTHON>/python3.9/site-packages/pandas/core/frame.py", line 9625, in unstack
result = unstack(self, level, fill_value, sort)
File "<PATH TO PYTHON>/python3.9/site-packages/pandas/core/reshape/reshape.py", line 506, in unstack
return obj.T.stack(future_stack=True)
File "<PATH TO PYTHON>/python3.9/site-packages/pandas/core/frame.py", line 9428, in stack
result = stack_v3(self, level)
File "<PATH TO PYTHON>/python3.9/site-packages/pandas/core/reshape/reshape.py", line 887, in stack_v3
raise ValueError("Columns with duplicate values are not supported in stack")
ValueError: Columns with duplicate values are not supported in stack
Desktop (please complete the following information):
OS: Linux (Ubuntu 22.04)
Browser: N/A
Version: Python 3.9, pandas 2.1.0, pandera 0.16.1
Screenshots
Not applicable.
Additional context
When running the example code with pandas 2.0.3 (and pandera 0.16.1), I receive schema error as expected. Hence the problem appears to be caused by changes in pandas version 2.1.0 not being reflected, https://pandas.pydata.org/docs/whatsnew/v2.1.0.html
Describe the bug
When validating a
pandas
dataframe schema that requires aMultiIndex
with unique indexes (columns), I would expectpandera.errors.SchemaError: ... not unique
exception to be raised bypandera
. However, sincepandas
version2.1.0
, apandas
error happens during validation, so I receive apandas
ValueError
instead, which is not expected behavior.Code Sample, a copy-pastable example
Expected behavior
I expect to receive the following error:
However, I instead receive the following error (showing also traceback):
Desktop (please complete the following information):
pandas 2.1.0
,pandera 0.16.1
Screenshots
Not applicable.
Additional context
When running the example code with
pandas 2.0.3
(andpandera 0.16.1
), I receive schema error as expected. Hence the problem appears to be caused by changes inpandas
version2.1.0
not being reflected, https://pandas.pydata.org/docs/whatsnew/v2.1.0.html