unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.25k stars 302 forks source link

Does pandera support validating enum.Enum or subclases of it ? #911

Open cosmicBboy opened 2 years ago

cosmicBboy commented 2 years ago

Discussed in https://github.com/unionai-oss/pandera/discussions/907

Originally posted by **davidandreoletti** August 8, 2022 Assuming a pedantic like class declaration with: ``` class SizeEnum(enum.Enum): BIG = "big" SMALL = "small" ``` ``` class SummaryDFSchema(pandera...): size : pandera.Series[SizeEnum] name : ... ``` Currently pandera fails (via exception raised) because it seems pandas does not reconignize the Enum as a registered custom dtype. What methods/workaround could be used to let pandera enforce/check the column contains SizeEnum types (rather than one of its string values such as "big")?
cosmicBboy commented 2 years ago

The DataTypes extension api would allow support for Enums, see: https://pandera.readthedocs.io/en/stable/dtypes.html

One of the main design choices here would be: how to represent enums? There could be several options to represent these in the underlying dataframe:

From experience, there may be some issues using the actual Enum as an object type when it comes to certain operations, tho I haven't tested it out in a while.

Open to ideas and perhaps a PR @davidandreoletti?

the-matt-morris commented 2 years ago

This is a cool idea. It's something I've added in for a specific use case of mine, though I admit all my Enum subclasses are strings so I'm not hitting on any obscure cases. Gotta be a better way to do this than what I hacked together, but I added a step to SchemaModel.__init_subclass__ to update fields with Enum annotations to have categorical types with the categories defined by the Enum values.

One argument for using categorical type is that it can handle data types other than strings:

This one is the first example in the enum docs:

from enum import Enum
import pandas as pd

class Color(Enum):
    RED = 1
    GREEN = 2
    BLUE = 3

class MySchema(SchemaModel):
    color: Series[Color]

df = pd.DataFrame({"color": [1, 2, 3]})
MySchema.validate(df)
  color
0     1
1     2
2     3
df = pd.DataFrame({"color": [1, 2, 3, 4]})
MySchema.validate(df)
...
pandera.errors.SchemaError: Error while coercing 'color' to type category: Could not coerce <class 'pandas.core.series.Series'> data_container into type category:
   index  failure_case
0      3             4

That said, not sure what to do about Enum subclasses that have values that are not scalars:

class Planet(Enum):
    MERCURY = (3.303e+23, 2.4397e6)
    VENUS   = (4.869e+24, 6.0518e6)
    EARTH   = (5.976e+24, 6.37814e6)
    MARS    = (6.421e+23, 3.3972e6)
    JUPITER = (1.9e+27,   7.1492e7)
    SATURN  = (5.688e+26, 6.0268e7)
    URANUS  = (8.686e+25, 2.5559e7)
    NEPTUNE = (1.024e+26, 2.4746e7)
    def __init__(self, mass, radius):
        self.mass = mass       # in kilograms
        self.radius = radius   # in meters

In this case, the expected values in the series would be (float, float) tuples that correspond to values of a Planet value. Maybe that is ok.

It seems an important question regards whether the expected series values are instances of the Enum (i.e. Color.RED) or the Enum values (1). I would think the values, but thoughts on that?

cosmicBboy commented 2 years ago

It seems an important question regards whether the expected series values are instances of the Enum (i.e. Color.RED) or the Enum values (1). I would think the values, but thoughts on that?

@davidandreoletti what do you think?

dantheand commented 2 years ago

For string-type enums, I've found a fairly elegant solution is to pass them directly to pandera.Field as the isin= argument. I don't know if this would work for non-hashable python objects.

import enum
import pandera
import pandas as pd

class TestEnum(str, enum.Enum):
    CLASS_1 = "class 1"
    CLASS_2 = "class 2"

class Schema(pandera.SchemaModel):
    class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum)
davidandreoletti commented 2 years ago

@cosmicBboy I think @dantheand's solution is elegant and support most data types.

Perhaps, class_col: Series[TestEnum] should be a syntactic sugar for class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum) ?

cosmicBboy commented 2 years ago

Perhaps, class_col: Series[TestEnum] should be a syntactic sugar for class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum).

So data types and checks are intentionally separated concerns in pandera... the reason being that data types have the additional capability of coercing (i.e. parsing) raw data into the desired types, which correspond to some machine-level representation (e.g. int64, str, etc) whereas checks are simply functions that validate a property of the potentially-coerced data.

Conflating the two by converting class_col: Series[TestEnum] to class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum) under the hood would introduce additional complexity to the library with not that many keystrokes saved.

The takeaway here is that class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum) is a good enough solution for supporting enums in pandera, but for a deeper integration with the type system, defining a custom data type would be necessary if you want to take advantage of encoding Enums as pandas Categorical types, for example.

Why? Because Enums are not limited to string dtypes, could potentially be ordered (pandas categoricals support this, but python doesn't out-of-the box), and .

If anyone's down to make a PR for this, I'd welcome it! @davidandreoletti @the-matt-morris @dantheand

dantheand commented 2 years ago

@davidandreoletti @cosmicBboy I actually just found out that using class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum) results in a yaml.representer.RepresenterError when trying to do SchemaModel.to_yaml(), so the best solution I've found is to convert to list of strings via list comp:

class_col: Series[pd.StringDtype] = pandera.Field(isin=[entry.value for entry in TestEnum])

@cosmicBboy When you say "custom data type", do you mean Logical Data type described in the documentation you linked?

cosmicBboy commented 2 years ago

I actually just found out that using class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum) results in a yaml.representer.RepresenterError when trying to do SchemaModel.to_yaml()

Ah! yeah the serialize checks logic would need to be updated to handle enums... feel free to open up a new issue if you want first class support here.

When you say "custom data type", do you mean Logical Data type described in the documentation you linked?

Yep! Although in this case a PR would make it a built-in datatype in pandera.engines.pandas_engine

racash007 commented 1 year ago

hi guys is there an update on this issue