unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.37k stars 310 forks source link

Parameterized dtype DatetimeTZDtype failing when used as a strategy? #534

Closed BakerAugust closed 3 years ago

BakerAugust commented 3 years ago

Question about pandera

Hi there, Thanks for the package. Love what you are doing with this project. I'm posing this as a question because I am unsure if this is a bug, an implemenatation error on my end, or expected behavior.

I am trying to use the parameterized dtype pd.DatetimeTZDtype in a SchemaModel. When using the resulting SchemaModel to validate other dataframes it works as expected, but when using that SchemaModel as a strategy, I am seeing the following error:

===================================== FAILURES ======================================
_______________________________ test_simple_strategy ________________________________

    @given(df=Schema.to_schema().strategy())
>   def test_simple_strategy(df):

src/hobo/tests/test_pandera.py:15: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../.pyenv/versions/tvg-ds/lib/python3.9/site-packages/pandera/strategies.py:1002: in _dataframe_strategy
    columns=[
../../.pyenv/versions/tvg-ds/lib/python3.9/site-packages/pandera/strategies.py:1003: in <listcomp>
    column.strategy_component()
../../.pyenv/versions/tvg-ds/lib/python3.9/site-packages/pandera/strategies.py:153: in _wrapper
    return fn(*args, **kwargs)
../../.pyenv/versions/tvg-ds/lib/python3.9/site-packages/pandera/schema_components.py:277: in strategy_component
    self.pdtype,
../../.pyenv/versions/tvg-ds/lib/python3.9/site-packages/pandera/schemas.py:1620: in pdtype
    return PandasDtype.from_str_alias(self.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <enum 'PandasDtype'>, str_alias = 'datetime64[ns, EST]'

    @classmethod
    def from_str_alias(cls, str_alias: str) -> "PandasDtype":
        """Get PandasDtype from string alias.

        :param: pandas dtype string alias from
            https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes
        :returns: pandas dtype
        """
        pandas_dtype = {
            "bool": cls.Bool,
            "datetime64[ns]": cls.DateTime,
            "timedelta64[ns]": cls.Timedelta,
            "category": cls.Category,
            "float": cls.Float,
            "float16": cls.Float16,
            "float32": cls.Float32,
            "float64": cls.Float64,
            "int": cls.Int,
            "int8": cls.Int8,
            "int16": cls.Int16,
            "int32": cls.Int32,
            "int64": cls.Int64,
            "uint8": cls.UInt8,
            "uint16": cls.UInt16,
            "uint32": cls.UInt32,
            "uint64": cls.UInt64,
            "Int8": cls.INT8,
            "Int16": cls.INT16,
            "Int32": cls.INT32,
            "Int64": cls.INT64,
            "UInt8": cls.UINT8,
            "UInt16": cls.UINT16,
            "UInt32": cls.UINT32,
            "UInt64": cls.UINT64,
            "object": cls.Object,
            "complex": cls.Complex,
            "complex64": cls.Complex64,
            "complex128": cls.Complex128,
            "complex256": cls.Complex256,
            "str": cls.String,
            "string": cls.String if LEGACY_PANDAS else cls.STRING,
        }.get(str_alias)

        if pandas_dtype is None:
>           raise TypeError(
                f"pandas dtype string alias '{str_alias}' not recognized"
            )
E           TypeError: pandas dtype string alias 'datetime64[ns, EST]' not recognized

../../.pyenv/versions/tvg-ds/lib/python3.9/site-packages/pandera/dtypes.py:207: TypeError

Here's some test code

# python 3.9.0

# hypothesis==6.14.0
# pandas==1.2.4
# pandera==0.6.4

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
from hypothesis import given

# Test example from docs: https://pandera.readthedocs.io/en/latest/schema_models.html#field
class Schema(pa.SchemaModel):
    col1: Series[pd.DatetimeTZDtype] = pa.Field(
        dtype_kwargs={"unit": "ns", "tz": "EST"}
    )

# This works as expected
def test_validate_with_schema():
    df = DataFrame()
    Schema.validate(df)

# This produces the error
@given(df=Schema.to_schema().strategy())
def test_simple_strategy(df):
    print(df.head())
cantos688 commented 3 years ago

I noticed this issue today as well. I'm not sure if from_str_alias supports datetimes with timezones. I get this same error with infer_schema function call on a dataset that has UTC timestamp listed as string alias as 'datetime64[ns, UTC]'.

cosmicBboy commented 3 years ago

hey @BakerAugust thanks for filing this issue! looking into it now

th0ger commented 3 years ago

I have a similar issue (if not the same):

from pandera import Column, DataFrameSchema
import pandas as pd
s = DataFrameSchema(
    {
        'timestamp': Column(dtype=pd.DatetimeTZDtype(tz="UTC"))
    })
print(s.example())

Returns TypeError: Invalid datetime unit in metadata string "[ns, utc]"

jeffzi commented 3 years ago

I could reproduce @th0ger's error with version v0.7.0.

Starting in v0.7.0, pandas.DatetimeTZDtype is recognized by the pandera DataType hierarchy. Now, the problem is that strategies leverage the hypothesis library which requires numpy dtypes. Numpy cannot cast a pandas.DatetimeTZDtype to a regular datetime64. In fact, numpy does not support timezone-aware datetime at all.

import numpy as np
import pandas as pd
from pandera.engines.pandas_engine import Engine

pandera_dtype = Engine.dtype(pd.DatetimeTZDtype(tz="UTC"))
alias = str(pandera_dtype)
print(alias)
#> datetime64[ns, UTC]
np.dtype(alias)
#> Traceback (most recent call last):
#> ----> 1 np.dtype(alias)
#> TypeError: Invalid datetime unit in metadata string "[ns, UTC]"

It should be possible to work around that limitation by downcasting to naive datetime64 in pandera.strategies.column/index_strategy and casting to DatetimeTZDtype when the series is generated.

@cosmicBboy I can give it a shot.

cosmicBboy commented 3 years ago

Awesome, thanks @jeffzi!

th0ger commented 3 years ago

A bit off topic: I'm leveling up on pandera, hypothesis, and property based testing (PBT) in general. This is great. I also assume that contributers to this project are test geeks, taking your own PBT medicine 😀. So, v0.7.0 added support for pandas.DatetimeTZDtype, and it turned out to work for validation but not strategies. So I'm curious why the tests didn't capture this bug before release?

jeffzi commented 3 years ago

v0.7.0 introduced a complete rework of how internal dtypes representation. Strategies are a separate feature, tested independently.

Concretely, tests are applied to a list of pandera.dtypes.DataType that can be translated to numpy equivalents. That list only contains dtypes initialized with the default constructor. Unfortunately pandera.engines.pandas_engine.DateTime (internal representation for pandas.DatetimeTZDtype) is not timezone-aware by default. https://github.com/pandera-dev/pandera/blob/abc817feffc736acd8978cde7e7cf50c3cad2983/tests/strategies/test_strategies.py#L30-L40

Just adding pandas_engine.Engine.dtype("datetime64[ns, UTC]") to SUPPORTED_DTYPES is enough to add it to the current strategies test suite. I'm also writing specialized tests for that particular dtype because it is treated slightly differently than other dtypes in order to play nicely with hypothesis.

I should be able to push a PR in the coming days.

jeffzi commented 3 years ago

@th0ger @cantos688 @BakerAugust The fix for this bug (#595) has been merged !

th0ger commented 3 years ago

The provided test case is fixed (pandera 0.7.1). Tested it for non-UTC as well:

from pandera import Column, DataFrameSchema
import pandas as pd

dtype = pd.DatetimeTZDtype(tz="US/Central")
print(dtype)

s = DataFrameSchema(
    {
        'timestamp': Column(dtype=dtype)
    })
print(s.example())
datetime64[ns, US/Central]
                            timestamp
0 2187-08-13 05:05:14.571410014-06:00
1 1969-12-31 18:00:00.000064212-06:00
2 2167-12-08 07:29:01.542986145-06:00

Worth mentioning: If we use dtype = pd.DatetimeTZDtype(tz='dateutil/US/Central'), taken from the pandas.DatetimeTZDtype) documentation, I get TypeError: Invalid datetime unit in metadata string "[ns, tzfile('/usr/share/zoneinfo/US/Central')]"

jeffzi commented 3 years ago

@th0ger Thanks for testing, and reporting the dateutil case.

Historically pandera relied on string aliases to coerce dtypes but str(pd.DatetimeTZDtype(tz='dateutil/US/Central')) is not recognized by pandas:

import pandas as pd

data = pd.Series(["2021-01-01 00:00"])
# ok
data.astype(pd.DatetimeTZDtype(tz="dateutil/US/Central"))
#> 0   2021-01-01 00:00:00-06:00
dtype: datetime64[ns, tzfile('/usr/share/zoneinfo/US/Central')]

# fail
pandas_dtype = str(pd.DatetimeTZDtype(tz="dateutil/US/Central"))
try:
    data.astype(pandas_dtype)
except TypeError:
    print(pandas_dtype)
#> datetime64[ns, tzfile('/usr/share/zoneinfo/US/Central')]

I've pushed a fix for strategies (#620) to avoid using string aliases and verified that it fixed your case.

th0ger commented 3 years ago

@jeffzi leveled up for fast response: +1