unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.37k stars 310 forks source link

Decouple pandera and pandas type systems #369

Closed cosmicBboy closed 3 years ago

cosmicBboy commented 3 years ago

Is your feature request related to a problem? Please describe.

Currently, pandera's type system is strongly coupled to the pandas type system. This works well in pandera's current state since it only supports pandas dataframe validation. However, in order to obtain a broader coverage of dataframe-like data structures in the python ecosystem, I think it makes sense to slowly move towards this goal by abstracting pandera's type system so that it's not so strongly coupled with pandas' type system.

The PandasDtype enum class needs to be made more flexible such that it supports types with dynamic definitions like CategoryDtype and PeriodDtype.

Describe the solution you'd like

TBD

Describe alternatives you've considered

TBD

Additional context

TBD

jeffzi commented 3 years ago

I'd like to suggest another use case for a refactor of pandera dtypes.

I use pandera to validate pandas DataFrames that are ultimately written as parquet files via pyarrow. Parquet supports date and decimal types which are not natively supported by pandas but can be stored in object columns. I wanted to let pandera coerce to those types during validation. So far, my solution has been to subclass pandera.Column and override coerce_dtype.

Example for date:

from typing import Optional
import pandas as pd
import pandera as pa

class DateColumn(pa.Column):
    """Column containing date values encoded as date types (not datetime)."""

    def __init__( 
        self,
        checks: pa.schemas.CheckList = None,
        nullable: bool = False,
        allow_duplicates: bool = True,
        coerce: bool = False,
        required: bool = True,
        name: Optional[str] = None,
        regex: bool = False,
    ) -> None:
        super().__init__(
            pa.Object, # <===========
            checks=checks,
            nullable=nullable,
            allow_duplicates=allow_duplicates,
            coerce=coerce,
            required=required,
            name=name,
            regex=regex,
        )

    def coerce_dtype(self, series: pd.Series) -> pd.Series:
        """Coerce a pandas.Series to date types."""
        try:
            dttms = pd.to_datetime(series, infer_datetime_format=True, utc=True)
        except TypeError as err:
            msg = f"Error while coercing '{self.name} to type'date'"
            raise TypeError(msg) from err
        return dttms.dt.date

schema = pa.DataFrameSchema(columns={"dt": DateColumn(coerce=True)})
df = pd.DataFrame({'dt':pd.date_range("2021-01-01", periods=1, freq="H")})
print(df)
#>           dt
#> 0 2021-01-01
print(df["dt"].dtype)
#> datetime64[ns]

df = schema.validate(df)
print(df)
#>            dt
#> 0  2021-01-01
print(df["dt"].dtype)
#> object

Created on 2021-01-07 by the reprexpy package

The issue with the above is that custom columns are not compatible with SchemaModel.

I suggest to move the coerce_dtype() to a Dtype class that could be subclassed to add arguments to the __init__ and/or modify coercion. That would help decouple the coercion logic from DataFrameSchema. e.g: https://github.com/pandera-dev/pandera/blob/bfdb118504f78f7dcf6dcba96f6200194b4f5fbf/pandera/schemas.py#L285

376 already lists a couple of solutions to pass arguments to dtypes in SchemaModel (e.g Decimal needs precision and scale).

cosmicBboy commented 3 years ago

Cool, thanks for describing this use case!

I suggest to move the coerce_dtype() to a Dtype class that could be subclassed to add arguments to the init and/or modify coercion.

👍 to sketch out some ideas for the dtype class:

import pandas as pd

from enum import Enum

# abstract class spec, should support types for other dataframe-like data structures
# e.g. spark dataframes, dask, ray, vaex, xarray, etc.
class DataType:

    def __call__(self, obj):  # obj should be an arbitrary object
        """Coerces object to the dtype."""
        raise NotImplementedError

    def __eq__(self, other):
        # some default hash implementation
        pass

    def __hash__(self):
        # some default hash implementation
        pass

class PandasDataType(DataType):

    def __init__(self, str_alias):
        self.str_alias = str_alias

    def __call__(self, obj):
        # obj should be a pandas DataFrame, Series, or Index
        return obj.astype(self.str_alias)

# re-implementation of dtypes.PandasDtype, which is currently an Enum class,
# preserving the PandasDtype class for backwards compatibility
class PandasDtype:

    # See the pandas dtypes docs for more information:
    # https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes

    Bool = PandasDataType("bool")
    DateTime = PandasDataType("datetime64[ns]")
    Timedelta = PandasDataType("timedelta64[ns]")
    # etc. for the rest of the data-types that don't need additional arguments

    # use static methods for datatypes with additional arguments
    @staticmethod
    def DatetimeTZ(tz="utc"):
        return PandasDataType(f"datetime64[ns, <{tz}>]")

    @staticmethod
    def Period(freq):
        pass

    @staticmethod
    def Interval(numpy_dtype=None, tz=None, freq=None):
        pass

    @staticmethod
    def Categorical(categories=None, ordered=False):
        pass
jeffzi commented 3 years ago

preserving the PandasDtype class for backwards compatibility

The point would be to minimize the impact on the implementation, isn't it?

The Enum is not mentioned in the documentation (besides the API section). All examples exclusively use the aliases such as pa.Int, etc. I don't see a situation where the end-user would need to deal with PandasDtype, other than through the aliases that are exposed for convenience (and possibly unify across dataframe types if pandera is expanded to spark, etc.)

This is what I have in mind:

class PandasDataType(DataType): # can be renamed to PandasDtype if it facilitates implementation
    ...

Bool = PandasDataType("bool")
... # other straightforward dtypes

class DatetimeTZ(PandasDataType):

    def __init__(self, tz="utc"):
        super(f"datetime64[ns, <{tz}>]")
        self.tz = tz # in case we need it for other methods

class Datetime(PandasDataType):

    # args forwarded to pd.to_datetime, used for better coercion (if coerce=True)
    def __init__(self, dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, ...):
        ...

     def __call__(self, obj):
         return pd.to_datetime(obj, dayfirst=self.day_first, ...)
cosmicBboy commented 3 years ago

I'm all for simplification, I think I didn't take into account the fact that most users probably never use PandasDtype directly :)

Another place PandasDtype enum shows up in the documentation is here: https://pandera.readthedocs.io/en/stable/schema_inference.html#write-to-a-python-script, so part of the solution in the issue here would be to change e.g. PandasDtype.Int to pa.Int in the io.to_script function. I don't think changing this will have a big impact to the end user, since writing a schema out as a python script is mainly meant as an initial code template for when a user infers a schema.

Getting away from the PandasDtype enum would be a good thing, and adding documentation on how to implement custom DataTypes and coercion logic would be a move in the right direction.

cosmicBboy commented 3 years ago

hey @jeffzi, just wanted to bring your attention to the visions package: https://dylan-profiler.github.io/visions/visions/getting_started/usage/types.html

It's written by the same people who wrote pandas-profiling, and it might be worth a look to see if it fits our needs here. If not it might nevertheless be a good source of inspiration/ideas.

jeffzi commented 3 years ago

Thanks @cosmicBboy, I did not know about visions.

Before presenting my evaluation of visions, let me restate the goals of the dtype refactor:

  1. Decouple pandera and pandas dtypes: opening up to other dataframe-like data structures in the python ecosystem, such as Apache Spark, Apache Arrow and xarray.
  2. Built-in support for dynamic dtypes: e.g. categorical dtype implementations often have a ordered and categories arguments.
  3. Class-based dtypes should integrate nicely with SchemaModel api.
  4. Allow end-user to customize dtype coercion. For example, pass a date format.

Now, regarding visions:

+:

-:

I think visions's network of dtypes is promising, especially for dtype inferrence. Unfortunately, I don't see major advantages over a simple class hierarchy.

I would keep the following ideas:

cosmicBboy commented 3 years ago

Just adding my thoughts here @jeffzi for the record

Unfortunately, I don't see major advantages over a simple class hierarchy.

Agreed! Let's go with our own class hierarchy, something like what we've discussed in https://github.com/pandera-dev/pandera/issues/369#issuecomment-757514810 and https://github.com/pandera-dev/pandera/issues/369#issuecomment-757514810

Specialized dtypes for IP addresses, emails, URLs, etc. They will play nicely with SchemaModel. Pydantic has a similar concept. It (probably) shows that users find them useful.

Would love this, we can tackle these once the major refactor for existing dtypes is done. (would also love a Text and Image data type, basically pointers to files local or remote files for ML modeling use cases)

Design a mechanism to restrict allowed dtypes. Maybe an extra argument in DataFrameSchema.init(allowed_dtypes)

+1 to this idea, we can also cross that bridge when we get there. Another thought I had was having it as a class attribute

class DataFrameSchema():
    allowed_dtypes = [...]

Let me know if you need any help with discussing approach/architecture!

jeffzi commented 3 years ago

I'm ready to share a proposal of dtype refactor. I iterated several times on the design and I have now a good base for the discussion. Full draft implementation is in this gist.

Here are the main ideas:

  1. DataType hierarchy that implements portable/abstract dtypes, not tied to a particular lib (pandas, Spark, etc.)

    • Its responsabilities are coercion and checking equality between dtypes.
    • Implemented as dataclasses to reduce boilerplate. We get immutability (required for hashing), hash, equality and repr for free.
  2. Backend class: A dtype factory. It holds a lookup table to convert input values (e.g. abstract pandera DataType, string alias, np.dtype, pandas dtype, etc.) to concrete pandera dtypes (see 3.). Two lookup modes are supported:

    • By type: Map a type to a function with signature Callable[[Any], Pandera Dtype] that takes an instance of the type and converts it to a dtype. e.g. Extract pd.CategoricalDtype runtime arguments to pandera.pandas.Category instance. This mechanism relies on functools.singledispatch and therefore can match subtypes of registered lookups.

    • By value: Direct mapping between an object and a pandera dtype. e.g. pandera.Int32 -> pandera.pandas.Int32, np.int32 -> pandera.pandas.Int32 singledispatch cannot dispatch on a type or literal, that's why we need the direct mapping.

  3. Concrete PandasDtype hierarchy. Each dtype here can subclass the appropriate abstract DataType (1.) We also need to register the dtype conversions with the decorator PandasBackend.register. The decorator registers by value if applied on a class, or register the conversion function if applied to a function .

@dataclass(frozen=True)
class Category(DataType):
    categories: Tuple[Any] = None  # immutable sequence to ensure safe hash
    ordered: bool = False

    def __post_init__(self) -> "Category":
        categories = tuple(self.categories) if self.categories is not None else None
        # bypass frozen dataclass
        # see https://docs.python.org/3/library/dataclasses.html#frozen-instances
        object.__setattr__(self, "categories", categories)

@dataclass(frozen=True)
class PandasDtype: # Generic dtype in case the user supplies an unknown dtype.
    native_dtype: Any = None  # get pandas native dtype (useful for strategy module)

    def coerce(self, obj: PandasObject) -> PandasObject:
        return obj.astype(self.native_dtype)

@PandasBackend.register( # conversion for default Category 
    Category, Category(), #  pandera.Category
    pd.CategoricalDtype, pd.CategoricalDtype()
)
@dataclass(frozen=True)
class PandasCategory(PandasDtype, Category):
    def __post_init__(self) -> "PandasDtype":
        super().__post_init__()
        object.__setattr__(
            self, "native_dtype", pd.CategoricalDtype(self.categories, self.ordered)
        )

     # conversion for category instance with non-default arguments
    @PandasBackend.register(Category, pd.CategoricalDtype)
    def _to_pandas_category(cat: pd.CategoricalDtype):
        return PandasCategory(cat.categories, cat.ordered)

assert (
    PandasBackend.dtype(Category) # by value
    == PandasBackend.dtype(Category()) # by value
    == PandasBackend.dtype(pd.CategoricalDtype) # by value
    == PandasBackend.dtype(pd.CategoricalDtype()) # by value
    == PandasCategory()
)

assert (
    PandasBackend.dtype(pd.CategoricalDtype(["a", "b"], ordered=True)) # by type
    == PandasBackend.dtype(Category(["a", "b"], ordered=True)) # by type
    == PandasCategory(["a", "b"], ordered=True)
)

The design avoids endless if-elses because each dtype is self-contained.

Hopefully that's not over-engineered. Let's discuss if we can simplify or identify loopholes before we move on to implementing all dtypes and integrate it to pandera.

cosmicBboy commented 3 years ago

great design work @jeffzi ! will read this and the gist over and chew on it for a few days.

ryanhaarmann commented 3 years ago

Hi, I'm highly interested in the usage of pandera but support for PySpark dataframe/schemas is really needed. I would really like to see this make it into a release. I'm also willing to participate in creating the PySpark schema/types variant.

cosmicBboy commented 3 years ago

@jeffzi the implementation looks good to me overall!

I'm having a hard time grokking PandasCategory. _to_pandas_category... how is the first arg not self? I think it would be beneficial to be really explicit (in naming things in this module) about pandera abstract DataTypes, library-specific e.g. pandas types, e.g. pd.CategoricalDtype, and instances of pandas data types, e.g. pd.CategoricalDtype(categories=list("abc"), ordered=False)

I wonder if we can abstract out _to_pandas_category to something like from_dtype_instance and make it something that PandasDtype subclasses can implement to handle the case of pandas dtypes with non-default arguments.

Also have a few questions:

  1. what's the purpose of native_dtype?
  2. can you elaborate on the difference between lookup "by type" and "by value"? The inline comments here are kinda confusing to me, as there are some cases where PandasBackend.dtype(pd.CategoricalDtype()) is "by value" and PandasBackend.dtype(pd.CategoricalDtype(["a", "b"], ordered=True)) is "by type"... aren't both of these cases "by value", as in type instances of the pd.CategoricalDtype type?
cosmicBboy commented 3 years ago

I would really like to see this make it into a release. I'm also willing to participate in creating the PySpark schema/types variant.

hey @ryanhaarmann, thanks that would be awesome!

we'd appreciate your thoughts on this issue, but also a closely related one: https://github.com/pandera-dev/pandera/issues/381. Namely, would it be enough to leverage a library like koalas as a validation backend engine to perform validations on spark dataframes, or would you want access to the pyspark API when e.g. defining custom validation functions?

The benefit of supporting pandas-like API wrappers like koalas or modin is that pandera itself can leverage those libraries to validate at scale and reduce the complexity of supporting alternative APIs. As you can see from the description and initial thoughts in #381, supporting a different validation engine (i.e. non-pandas) will require a fair bit of design/implementation work, but may be worth it in the end

edit: I did some hacking around with koalas and modin and it's quite literally a few-line code change to add support for scaling pandera schemas to large dataframes using these packages. However, #381 might be worth doing anyway to (a) clean up the internal API and (b) support dataframes that don't follow the pandas API.

jeffzi commented 3 years ago
  1. what's the purpose of native_dtype?

It replicates the property PandasDtype.numpy_dtype. Pandas implementation will give back numpy or pandas dtypes, PySpark would give Spark types, etc. Currently, PandasDtype.numpy_dtype is only used for strategies. Generally speaking, it could be useful for specific DataFrameSchema/Check implementations to access the native dtypes.

  1. can you elaborate on the difference between lookup "by type" and "by value"?

I agree the code is confusing. In my mind, we have 2 kinds of inputs we want to accept for generating dtypes.

  1. Values (internally calls Backend._register_lookup(): The name is probably too vague, what I mean by "value" is anything that is not a instantiated object. We just lookup the values in a lookup dictionary. Example:

    • Literals (dtype aliases): currently handled by PandasDtype.from_alias(), from_pandas_api_type(). e.g "int32"
    • dtypes classes: e.g. pd.StringDtype, pd.CategoricalDtype, numpy.int32.
  2. Types (internally calls Backend._register_converter(): Relies on functools.singledispatch (= overloading in other OO languages). Singledispatch cannot dispatch on the string "int32". Currently, types are handled by PandasDtype.from_python_type, PandasDtype.from_numpy_dtype, and pandas extension types are handled directly in PandasDtype.get_dtype()

Another confusing part is that I wrapped those 2 mechanisms in a single decorator that automatically chooses the registration method. The idea was to hide the complexity. I agree it's too obscure, see end of this post for a solution.

there are some cases where PandasBackend.dtype(pd.CategoricalDtype()) is "by value" and PandasBackend.dtype(pd.CategoricalDtype(["a", "b"], ordered=True)) is "by type"... aren't both of these cases "by value", as in type instances of the pd.CategoricalDtype type?

I said "by value" for pd.CategoricalDtype() because I registered it as a lookup here. Actually they could be both by type since _to_pandas_category() can handle a default category. Even if we can technically register default dtypes by lookup, you are right that we should not since it will confuse readers.

PandasBackend.dtype(pd.CategoricalDtype(["a", "b"], ordered=True)) cannot be registered by value because we need singledispatch to dispatch the instantiated CategoricalDtype to _to_pandas_category(). By value would require to register all the combinations of parameters, which is impossible for that dtype.

I'm having a hard time grokking PandasCategory. _to_pandas_category... how is the first arg not self

That's because the function is forwarded to singledispatch that would then register "self" as the dispatch type.

I wonder if we can abstract out _to_pandas_category to something like from_dtype_instance and make it something that PandasDtype subclasses can implement to handle the case of pandas dtypes with non-default arguments.

Agreed. from_dtype_instance should be a class method since it will act as a factory that builds instances. Ideally we'd want to rely on the type annotations of from_dtype_instance to avoid decorating it just for the sake of listing types.

We can also rename the class decorator to register_dtype. It would only register by lookup, DataType.from_dtype_instance would take care of instantiated native dtypes.

jeffzi commented 3 years ago

Quick update.

examples:

@PandasEngine.register_dtype(
    akin=[pandera.dtype.Int64, pandera.dtype.Int64(), "int64", np.int64]
)
class PandasInt64(pandera.dtype.Int64, _PandasInt):
    nullable: bool = False

@PandasEngine.register_dtype(akin=[pandera.dtype.Category, pd.CategoricalDtype])
class PandasCategory(PandasDtype, pandera.dtype.Category):
    def __post_init__(self) -> "PandasDtype":
        super().__post_init__()
        object.__setattr__(
            # _native_dtype is used for coercion in base PandasDtype
            self, "_native_dtype", pd.CategoricalDtype(self.categories, self.ordered) 
        )

    @classmethod
    def from_parametrized_dtype(
        cls, cat: Union[pandera.dtype.Category, pd.CategoricalDtype]
    ):
        return PandasCategory(categories=cat.categories, ordered=cat.ordered)

from pandera.dtype import Category
from pandera.engines.pandas_engine import PandasCategory

assert (
    PandasEngine.dtype(Category)
    == PandasEngine.dtype(pd.CategoricalDtype)
    == PandasEngine.dtype(Category())  # dispatch via from_parametrized_dtype
    == PandasEngine.dtype(pd.CategoricalDtype())  # dispatch via from_parametrized_dtype
    == PandasCategory()
)

Hopefuly it's easier to understand, I'm quite happy with how it's turning out.

I did not update the gist (too lazy). Now, need to refactor all the calls to pandera.dtypes.PandasDtype and write some tests... I'm planning to open a PR once I have the core pandera functionalities working.

cosmicBboy commented 3 years ago

I know tests that involve testing types is kinda all over the place, hopefully it won't be too much of a pain to refactor 😅.

One minor point: I don't any objective points to back this up, but akin feels a little esoteric to me. Alternatives to consider might be equivalent_dtypes, equivalents, or members (as in members of a set of types).

jeffzi commented 3 years ago

At first I was going for "equivalent_dtypes" but it's very verbose and repeated many times. "equivalents" is perhaps a good middle-ground. English isn't my native language so I trust your judgment :)

jeffzi commented 3 years ago

Hi @cosmicBboy. I'm still working on this, aiming for a PR this weekend. Testing has been (very) time consuming!

cosmicBboy commented 3 years ago

thanks @jeffzi, yeah I'm sure you're uncovering all the random places there are type-related tests in the test suite 😅

jeffzi commented 3 years ago

fixed by #559