unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.34k stars 308 forks source link

Index of type category fails on validation #840

Open aartaria opened 2 years ago

aartaria commented 2 years ago

Validation for an index of type "category" fails starting from version 0.8.0

Minimal reproducible example

import pandas as pd
import pandera as pa

class Schema(pa.SchemaModel):
    categorical_field: pa.typing.Index[pa.Category]

df = (
    pd.DataFrame({"categorical_field": ["a", "b", "c"]})
    .astype({"categorical_field": "category"})
    .set_index("categorical_field")
)
Schema.validate(df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/env/lib/python3.8/site-packages/pandera/model.py", line 256, in validate
    cls.to_schema().validate(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 513, in validate
    return self._validate(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 709, in _validate
    error_handler.collect_error("schema_component_check", err)
  File "/env/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 701, in _validate
    result = schema_component(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 2043, in __call__
    return self.validate(
  File "/env/lib/python3.8/site-packages/pandera/schema_components.py", line 390, in validate
    super().validate(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 1976, in validate
    error_handler.collect_error(
  File "/env/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'categorical_field' to have type category, got object

Where it fails Here https://github.com/pandera-dev/pandera/blob/9a463e1757e2811bbfee4684562541a5f2110cc3/pandera/schema_components.py#L385-L387 the index gets converted to a numpy array but Categorical is not a numpy array and therefore validation fails

removing the numpy conversion lets the validation pass, but I do not know what else it would/could influence

cosmicBboy commented 2 years ago

thanks for reporting this @aartaria, this is definitely a bug!

don't exactly remember now why that to_numpy call is there, can you see which unit tests fail if you remove it? I have a suspicion it's there for the sake of supporting the pandas-like frameworks (pyspark.pandas, modin, or dask) but yeah ideally that wouldn't need to be called

cosmicBboy commented 2 years ago

https://github.com/unionai-oss/pandera/pull/856 fixed this apparantly, but just gonna keep this open, since #856 didn't add unit tests for the changes