unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.34k stars 308 forks source link

MultiIndex unique columns case - pandas error instead of (expected) schema error when pandas version 2.1.0 #1328

Open DrShushen opened 1 year ago

DrShushen commented 1 year ago

Describe the bug

When validating a pandas dataframe schema that requires a MultiIndex with unique indexes (columns), I would expect pandera.errors.SchemaError: ... not unique exception to be raised by pandera. However, since pandas version 2.1.0, a pandas error happens during validation, so I receive a pandas ValueError instead, which is not expected behavior.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa

df = pd.DataFrame(
    {
        "idx0": ["a", "a", "a"],
        "idx1": [1, 2, 1],
        "col1": [1.1, 1.2, 1.3],
    }
)
df.set_index(keys=["idx0", "idx1"], drop=True, inplace=True)
print(df)

schema = pa.DataFrameSchema(
    index=pa.MultiIndex(
        [pa.Index(str, name="idx0"), pa.Index(int, name="idx1")], unique=["idx0", "idx1"]
    )
)
print(schema)

schema.validate(df)  # Error received at this step. 

Expected behavior

I expect to receive the following error:

pandera.errors.SchemaError: columns '('idx0', 'idx1')' not unique:
          idx0  idx1
idx0 idx1           
a    1       a     1
     1       a     1

However, I instead receive the following error (showing also traceback):

Traceback (most recent call last):
  File "<LOCAL PATH>/example.py", line 20, in <module>
    schema.validate(df)
  File "<PATH TO PYTHON>/python3.9/site-packages/pandera/api/pandas/container.py", line 366, in validate
    return self._validate(
  File "<PATH TO PYTHON>/python3.9/site-packages/pandera/api/pandas/container.py", line 395, in _validate
    return self.get_backend(check_obj).validate(
  File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 97, in validate
    error_handler = self.run_checks_and_handle_errors(
  File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 151, in run_checks_and_handle_errors
    results = check(*args)  # type: ignore [operator]
  File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 192, in run_schema_component_checks
    result = schema_component.validate(
  File "<PATH TO PYTHON>/python3.9/site-packages/pandera/api/pandas/components.py", line 489, in validate
    return self.get_backend(check_obj).validate(
  File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/components.py", line 436, in validate
    validation_result = super().validate(
  File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 97, in validate
    error_handler = self.run_checks_and_handle_errors(
  File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 151, in run_checks_and_handle_errors
    results = check(*args)  # type: ignore [operator]
  File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/container.py", line 734, in check_column_values_are_unique
    failure_cases = reshape_failure_cases(failure_cases)
  File "<PATH TO PYTHON>/python3.9/site-packages/pandera/backends/pandas/error_formatters.py", line 94, in reshape_failure_cases
    failure_cases.rename_axis("column", axis=1)  # type: ignore[call-overload]
  File "<PATH TO PYTHON>/python3.9/site-packages/pandas/core/frame.py", line 9625, in unstack
    result = unstack(self, level, fill_value, sort)
  File "<PATH TO PYTHON>/python3.9/site-packages/pandas/core/reshape/reshape.py", line 506, in unstack
    return obj.T.stack(future_stack=True)
  File "<PATH TO PYTHON>/python3.9/site-packages/pandas/core/frame.py", line 9428, in stack
    result = stack_v3(self, level)
  File "<PATH TO PYTHON>/python3.9/site-packages/pandas/core/reshape/reshape.py", line 887, in stack_v3
    raise ValueError("Columns with duplicate values are not supported in stack")
ValueError: Columns with duplicate values are not supported in stack

Desktop (please complete the following information):

Screenshots

Not applicable.

Additional context

When running the example code with pandas 2.0.3 (and pandera 0.16.1), I receive schema error as expected. Hence the problem appears to be caused by changes in pandas version 2.1.0 not being reflected, https://pandas.pydata.org/docs/whatsnew/v2.1.0.html

felipeam86 commented 9 months ago

I confirm I have the same issue. I downgraded pandas to 2.0.3 and it works as expected