Closed cosmicBboy closed 3 years ago
I'd like to suggest another use case for a refactor of pandera dtypes.
I use pandera to validate pandas DataFrames that are ultimately written as parquet files via pyarrow. Parquet supports date and decimal types which are not natively supported by pandas but can be stored in object
columns. I wanted to let pandera coerce to those types during validation. So far, my solution has been to subclass pandera.Column
and override coerce_dtype.
Example for date:
from typing import Optional
import pandas as pd
import pandera as pa
class DateColumn(pa.Column):
"""Column containing date values encoded as date types (not datetime)."""
def __init__(
self,
checks: pa.schemas.CheckList = None,
nullable: bool = False,
allow_duplicates: bool = True,
coerce: bool = False,
required: bool = True,
name: Optional[str] = None,
regex: bool = False,
) -> None:
super().__init__(
pa.Object, # <===========
checks=checks,
nullable=nullable,
allow_duplicates=allow_duplicates,
coerce=coerce,
required=required,
name=name,
regex=regex,
)
def coerce_dtype(self, series: pd.Series) -> pd.Series:
"""Coerce a pandas.Series to date types."""
try:
dttms = pd.to_datetime(series, infer_datetime_format=True, utc=True)
except TypeError as err:
msg = f"Error while coercing '{self.name} to type'date'"
raise TypeError(msg) from err
return dttms.dt.date
schema = pa.DataFrameSchema(columns={"dt": DateColumn(coerce=True)})
df = pd.DataFrame({'dt':pd.date_range("2021-01-01", periods=1, freq="H")})
print(df)
#> dt
#> 0 2021-01-01
print(df["dt"].dtype)
#> datetime64[ns]
df = schema.validate(df)
print(df)
#> dt
#> 0 2021-01-01
print(df["dt"].dtype)
#> object
Created on 2021-01-07 by the reprexpy package
The issue with the above is that custom columns are not compatible with SchemaModel
.
I suggest to move the coerce_dtype()
to a Dtype
class that could be subclassed to add arguments to the __init__
and/or modify coercion. That would help decouple the coercion logic from DataFrameSchema. e.g:
https://github.com/pandera-dev/pandera/blob/bfdb118504f78f7dcf6dcba96f6200194b4f5fbf/pandera/schemas.py#L285
SchemaModel
(e.g Decimal needs precision and scale).Cool, thanks for describing this use case!
I suggest to move the coerce_dtype() to a Dtype class that could be subclassed to add arguments to the init and/or modify coercion.
👍 to sketch out some ideas for the dtype class:
import pandas as pd
from enum import Enum
# abstract class spec, should support types for other dataframe-like data structures
# e.g. spark dataframes, dask, ray, vaex, xarray, etc.
class DataType:
def __call__(self, obj): # obj should be an arbitrary object
"""Coerces object to the dtype."""
raise NotImplementedError
def __eq__(self, other):
# some default hash implementation
pass
def __hash__(self):
# some default hash implementation
pass
class PandasDataType(DataType):
def __init__(self, str_alias):
self.str_alias = str_alias
def __call__(self, obj):
# obj should be a pandas DataFrame, Series, or Index
return obj.astype(self.str_alias)
# re-implementation of dtypes.PandasDtype, which is currently an Enum class,
# preserving the PandasDtype class for backwards compatibility
class PandasDtype:
# See the pandas dtypes docs for more information:
# https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes
Bool = PandasDataType("bool")
DateTime = PandasDataType("datetime64[ns]")
Timedelta = PandasDataType("timedelta64[ns]")
# etc. for the rest of the data-types that don't need additional arguments
# use static methods for datatypes with additional arguments
@staticmethod
def DatetimeTZ(tz="utc"):
return PandasDataType(f"datetime64[ns, <{tz}>]")
@staticmethod
def Period(freq):
pass
@staticmethod
def Interval(numpy_dtype=None, tz=None, freq=None):
pass
@staticmethod
def Categorical(categories=None, ordered=False):
pass
preserving the PandasDtype class for backwards compatibility
The point would be to minimize the impact on the implementation, isn't it?
The Enum is not mentioned in the documentation (besides the API section). All examples exclusively use the aliases such as pa.Int
, etc. I don't see a situation where the end-user would need to deal with PandasDtype
, other than through the aliases that are exposed for convenience (and possibly unify across dataframe types if pandera is expanded to spark, etc.)
This is what I have in mind:
class PandasDataType(DataType): # can be renamed to PandasDtype if it facilitates implementation
...
Bool = PandasDataType("bool")
... # other straightforward dtypes
class DatetimeTZ(PandasDataType):
def __init__(self, tz="utc"):
super(f"datetime64[ns, <{tz}>]")
self.tz = tz # in case we need it for other methods
class Datetime(PandasDataType):
# args forwarded to pd.to_datetime, used for better coercion (if coerce=True)
def __init__(self, dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, ...):
...
def __call__(self, obj):
return pd.to_datetime(obj, dayfirst=self.day_first, ...)
I'm all for simplification, I think I didn't take into account the fact that most users probably never use PandasDtype
directly :)
Another place PandasDtype
enum shows up in the documentation is here:
https://pandera.readthedocs.io/en/stable/schema_inference.html#write-to-a-python-script, so part of the solution in the issue here would be to change e.g. PandasDtype.Int
to pa.Int
in the io.to_script
function. I don't think changing this will have a big impact to the end user, since writing a schema out as a python script is mainly meant as an initial code template for when a user infers a schema.
Getting away from the PandasDtype
enum would be a good thing, and adding documentation on how to implement custom DataType
s and coercion logic would be a move in the right direction.
hey @jeffzi, just wanted to bring your attention to the visions
package:
https://dylan-profiler.github.io/visions/visions/getting_started/usage/types.html
It's written by the same people who wrote pandas-profiling
, and it might be worth a look to see if it fits our needs here. If not it might nevertheless be a good source of inspiration/ideas.
Thanks @cosmicBboy, I did not know about visions
.
Before presenting my evaluation of visions
, let me restate the goals of the dtype refactor:
ordered
and categories
arguments. Now, regarding visions
:
+
:
Specialized types like IPAddress overlap with pandera Checks but are useful to reduce boilerplate. We could implement them as specialized column types but we'd need a new syntax for the SchemaModel instead of just annotating with specialized dtypes.
Typeset is an interesting concept that should be useful in the context of specialized schemas: e.g. JsonSchema
, MLSchema
.
-
:
In regards to 1., visions
only supports pandas at the moment and pandas is tied to core modules: e.g graph traversal . Object
, which is a numpy/pandas concept, is also a core dtype.
Goals 2. and 3, are not use-cases supported by visions
.
Focus on "logical types", no built-in support for "physical types" like int8
, float16
, etc.
During dtype inferrence, the series could be cast multiple times while traversing the graph. This is much less efficient than pandas'implementations of infer_dtype().
visions
requires networkx.
I think visions
's network of dtypes is promising, especially for dtype inferrence. Unfortunately, I don't see major advantages over a simple class hierarchy.
I would keep the following ideas:
DataFrameSchema.__init__(allowed_dtypes)
? Just adding my thoughts here @jeffzi for the record
Unfortunately, I don't see major advantages over a simple class hierarchy.
Agreed! Let's go with our own class hierarchy, something like what we've discussed in https://github.com/pandera-dev/pandera/issues/369#issuecomment-757514810 and https://github.com/pandera-dev/pandera/issues/369#issuecomment-757514810
Specialized dtypes for IP addresses, emails, URLs, etc. They will play nicely with SchemaModel. Pydantic has a similar concept. It (probably) shows that users find them useful.
Would love this, we can tackle these once the major refactor for existing dtypes is done. (would also love a Text
and Image
data type, basically pointers to files local or remote files for ML modeling use cases)
Design a mechanism to restrict allowed dtypes. Maybe an extra argument in DataFrameSchema.init(allowed_dtypes)
+1 to this idea, we can also cross that bridge when we get there. Another thought I had was having it as a class attribute
class DataFrameSchema():
allowed_dtypes = [...]
Let me know if you need any help with discussing approach/architecture!
I'm ready to share a proposal of dtype refactor. I iterated several times on the design and I have now a good base for the discussion. Full draft implementation is in this gist.
Here are the main ideas:
DataType
hierarchy that implements portable/abstract dtypes, not tied to a particular lib (pandas, Spark, etc.)
Backend
class: A dtype factory. It holds a lookup table to convert input values (e.g. abstract pandera DataType
, string alias, np.dtype, pandas dtype, etc.) to concrete pandera dtypes (see 3.). Two lookup modes are supported:
By type: Map a type to a function with signature Callable[[Any], Pandera Dtype]
that takes an instance of the type and converts it to a dtype.
e.g. Extract pd.CategoricalDtype
runtime arguments to pandera.pandas.Category
instance.
This mechanism relies on functools.singledispatch and therefore can match subtypes of registered lookups.
By value: Direct mapping between an object and a pandera dtype.
e.g. pandera.Int32 -> pandera.pandas.Int32
, np.int32 -> pandera.pandas.Int32
singledispatch
cannot dispatch on a type or literal, that's why we need the direct mapping.
Concrete PandasDtype
hierarchy. Each dtype here can subclass the appropriate abstract DataType
(1.)
We also need to register the dtype conversions with the decorator PandasBackend.register
. The decorator registers by value if applied on a class, or register the conversion function if applied to a function .
@dataclass(frozen=True)
class Category(DataType):
categories: Tuple[Any] = None # immutable sequence to ensure safe hash
ordered: bool = False
def __post_init__(self) -> "Category":
categories = tuple(self.categories) if self.categories is not None else None
# bypass frozen dataclass
# see https://docs.python.org/3/library/dataclasses.html#frozen-instances
object.__setattr__(self, "categories", categories)
@dataclass(frozen=True)
class PandasDtype: # Generic dtype in case the user supplies an unknown dtype.
native_dtype: Any = None # get pandas native dtype (useful for strategy module)
def coerce(self, obj: PandasObject) -> PandasObject:
return obj.astype(self.native_dtype)
@PandasBackend.register( # conversion for default Category
Category, Category(), # pandera.Category
pd.CategoricalDtype, pd.CategoricalDtype()
)
@dataclass(frozen=True)
class PandasCategory(PandasDtype, Category):
def __post_init__(self) -> "PandasDtype":
super().__post_init__()
object.__setattr__(
self, "native_dtype", pd.CategoricalDtype(self.categories, self.ordered)
)
# conversion for category instance with non-default arguments
@PandasBackend.register(Category, pd.CategoricalDtype)
def _to_pandas_category(cat: pd.CategoricalDtype):
return PandasCategory(cat.categories, cat.ordered)
assert (
PandasBackend.dtype(Category) # by value
== PandasBackend.dtype(Category()) # by value
== PandasBackend.dtype(pd.CategoricalDtype) # by value
== PandasBackend.dtype(pd.CategoricalDtype()) # by value
== PandasCategory()
)
assert (
PandasBackend.dtype(pd.CategoricalDtype(["a", "b"], ordered=True)) # by type
== PandasBackend.dtype(Category(["a", "b"], ordered=True)) # by type
== PandasCategory(["a", "b"], ordered=True)
)
The design avoids endless if-elses because each dtype is self-contained.
Hopefully that's not over-engineered. Let's discuss if we can simplify or identify loopholes before we move on to implementing all dtypes and integrate it to pandera.
great design work @jeffzi ! will read this and the gist over and chew on it for a few days.
Hi, I'm highly interested in the usage of pandera
but support for PySpark dataframe/schemas is really needed. I would really like to see this make it into a release. I'm also willing to participate in creating the PySpark schema/types variant.
@jeffzi the implementation looks good to me overall!
I'm having a hard time grokking PandasCategory. _to_pandas_category
... how is the first arg not self
? I think it would be beneficial to be really explicit (in naming things in this module) about pandera abstract DataTypes
, library-specific e.g. pandas types, e.g. pd.CategoricalDtype
, and instances of pandas data types, e.g. pd.CategoricalDtype(categories=list("abc"), ordered=False)
I wonder if we can abstract out _to_pandas_category
to something like from_dtype_instance
and make it something that PandasDtype
subclasses can implement to handle the case of pandas dtypes with non-default arguments.
Also have a few questions:
native_dtype
?PandasBackend.dtype(pd.CategoricalDtype())
is "by value" and PandasBackend.dtype(pd.CategoricalDtype(["a", "b"], ordered=True))
is "by type"... aren't both of these cases "by value", as in type instances of the pd.CategoricalDtype
type?I would really like to see this make it into a release. I'm also willing to participate in creating the PySpark schema/types variant.
hey @ryanhaarmann, thanks that would be awesome!
we'd appreciate your thoughts on this issue, but also a closely related one: https://github.com/pandera-dev/pandera/issues/381. Namely, would it be enough to leverage a library like koalas as a validation backend engine to perform validations on spark dataframes, or would you want access to the pyspark API when e.g. defining custom validation functions?
The benefit of supporting pandas-like API wrappers like koalas or modin is that pandera itself can leverage those libraries to validate at scale and reduce the complexity of supporting alternative APIs. As you can see from the description and initial thoughts in #381, supporting a different validation engine (i.e. non-pandas) will require a fair bit of design/implementation work, but may be worth it in the end
edit: I did some hacking around with koalas and modin and it's quite literally a few-line code change to add support for scaling pandera schemas to large dataframes using these packages. However, #381 might be worth doing anyway to (a) clean up the internal API and (b) support dataframes that don't follow the pandas API.
- what's the purpose of native_dtype?
It replicates the property PandasDtype.numpy_dtype. Pandas implementation will give back numpy or pandas dtypes, PySpark would give Spark types, etc. Currently, PandasDtype.numpy_dtype
is only used for strategies. Generally speaking, it could be useful for specific DataFrameSchema/Check
implementations to access the native dtypes.
- can you elaborate on the difference between lookup "by type" and "by value"?
I agree the code is confusing. In my mind, we have 2 kinds of inputs we want to accept for generating dtypes.
Values (internally calls Backend._register_lookup(): The name is probably too vague, what I mean by "value" is anything that is not a instantiated object. We just lookup the values in a lookup dictionary. Example:
PandasDtype.from_alias()
, from_pandas_api_type()
. e.g "int32"pd.StringDtype
, pd.CategoricalDtype
, numpy.int32
.Types (internally calls Backend._register_converter(): Relies on functools.singledispatch (= overloading in other OO languages). Singledispatch cannot dispatch on the string "int32". Currently, types are handled by PandasDtype.from_python_type
, PandasDtype.from_numpy_dtype
, and pandas extension types are handled directly in PandasDtype.get_dtype()
Another confusing part is that I wrapped those 2 mechanisms in a single decorator that automatically chooses the registration method. The idea was to hide the complexity. I agree it's too obscure, see end of this post for a solution.
there are some cases where PandasBackend.dtype(pd.CategoricalDtype()) is "by value" and PandasBackend.dtype(pd.CategoricalDtype(["a", "b"], ordered=True)) is "by type"... aren't both of these cases "by value", as in type instances of the pd.CategoricalDtype type?
I said "by value" for pd.CategoricalDtype()
because I registered it as a lookup here. Actually they could be both by type since _to_pandas_category() can handle a default category. Even if we can technically register default dtypes by lookup, you are right that we should not since it will confuse readers.
PandasBackend.dtype(pd.CategoricalDtype(["a", "b"], ordered=True))
cannot be registered by value because we need singledispatch
to dispatch the instantiated CategoricalDtype to _to_pandas_category()
. By value would require to register all the combinations of parameters, which is impossible for that dtype.
I'm having a hard time grokking PandasCategory. _to_pandas_category... how is the first arg not self
That's because the function is forwarded to singledispatch that would then register "self" as the dispatch type.
I wonder if we can abstract out _to_pandas_category to something like from_dtype_instance and make it something that PandasDtype subclasses can implement to handle the case of pandas dtypes with non-default arguments.
Agreed. from_dtype_instance
should be a class method since it will act as a factory that builds instances. Ideally we'd want to rely on the type annotations of from_dtype_instance
to avoid decorating it just for the sake of listing types.
We can also rename the class decorator to register_dtype
. It would only register by lookup, DataType.from_dtype_instance
would take care of instantiated native dtypes.
Quick update.
Backend
to Engine
:tada: Engine.register_dtype(akin: List[Dtype])
. It does 2 things:
akin
lists equivalent dtypes that can be directly mapped to the default constructor.from_parametrized_dtype
, an optional class method, and register it. This method is used to convert from parametrized dtype instances. Dispatch is based on the first argument.examples:
@PandasEngine.register_dtype(
akin=[pandera.dtype.Int64, pandera.dtype.Int64(), "int64", np.int64]
)
class PandasInt64(pandera.dtype.Int64, _PandasInt):
nullable: bool = False
@PandasEngine.register_dtype(akin=[pandera.dtype.Category, pd.CategoricalDtype])
class PandasCategory(PandasDtype, pandera.dtype.Category):
def __post_init__(self) -> "PandasDtype":
super().__post_init__()
object.__setattr__(
# _native_dtype is used for coercion in base PandasDtype
self, "_native_dtype", pd.CategoricalDtype(self.categories, self.ordered)
)
@classmethod
def from_parametrized_dtype(
cls, cat: Union[pandera.dtype.Category, pd.CategoricalDtype]
):
return PandasCategory(categories=cat.categories, ordered=cat.ordered)
from pandera.dtype import Category
from pandera.engines.pandas_engine import PandasCategory
assert (
PandasEngine.dtype(Category)
== PandasEngine.dtype(pd.CategoricalDtype)
== PandasEngine.dtype(Category()) # dispatch via from_parametrized_dtype
== PandasEngine.dtype(pd.CategoricalDtype()) # dispatch via from_parametrized_dtype
== PandasCategory()
)
Hopefuly it's easier to understand, I'm quite happy with how it's turning out.
I did not update the gist (too lazy). Now, need to refactor all the calls to pandera.dtypes.PandasDtype
and write some tests...
I'm planning to open a PR once I have the core pandera functionalities working.
I know tests that involve testing types is kinda all over the place, hopefully it won't be too much of a pain to refactor 😅.
One minor point: I don't any objective points to back this up, but akin
feels a little esoteric to me. Alternatives to consider might be equivalent_dtypes
, equivalents
, or members
(as in members of a set of types).
At first I was going for "equivalent_dtypes" but it's very verbose and repeated many times. "equivalents" is perhaps a good middle-ground. English isn't my native language so I trust your judgment :)
Hi @cosmicBboy. I'm still working on this, aiming for a PR this weekend. Testing has been (very) time consuming!
thanks @jeffzi, yeah I'm sure you're uncovering all the random places there are type-related tests in the test suite 😅
fixed by #559
Is your feature request related to a problem? Please describe.
Currently,
pandera
's type system is strongly coupled to thepandas
type system. This works well in pandera's current state since it only supports pandas dataframe validation. However, in order to obtain a broader coverage of dataframe-like data structures in the python ecosystem, I think it makes sense to slowly move towards this goal by abstracting pandera's type system so that it's not so strongly coupled with pandas' type system.The
PandasDtype
enum class needs to be made more flexible such that it supports types with dynamic definitions likeCategoryDtype
andPeriodDtype
.Describe the solution you'd like
TBD
Describe alternatives you've considered
TBD
Additional context
TBD