date32 type not supported using infer_schema

andycarter85 commented 2 years ago

Describe the bug A clear and concise description of what the bug is.

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandera.

Hoping that pandera can handle date32 types, but this appears to raise an error.

Code Sample, a copy-pastable example

import pandas as pd
import pyarrow as pa
from io import BytesIO
import pandera

df = pd.DataFrame([pd.Timestamp.now().date()], columns=['mydate'])

pqtypes = {
    'mydate': pa.date32(),
}

buffer = BytesIO()
df.to_parquet(
    buffer,
    engine='pyarrow',
    schema=pa.schema([pa.field(x, y) for x, y in pqtypes.items()])
    )

buffer.seek(0)
del df

df2 = pd.read_parquet(buffer)
pandera.infer_schema(df2)

Traceback (most recent call last):
  File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\engines\pandas_engine.py", line 137, in dtype
    return engine.Engine.dtype(cls, data_type)
  File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\engines\engine.py", line 210, in dtype
    raise TypeError(
TypeError: Data type 'date' not understood by Engine.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users//AppData/Roaming/JetBrains/PyCharmCE2021.3/scratches/scratch.py", line 25, in <module>
    pandera.infer_schema(df2)
  File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\schema_inference.py", line 38, in infer_schema
    return infer_dataframe_schema(pandas_obj)
  File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\schema_inference.py", line 72, in infer_dataframe_schema
    df_statistics = infer_dataframe_statistics(df)
  File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\schema_statistics.py", line 15, in infer_dataframe_statistics
    inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
  File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\schema_statistics.py", line 15, in <dictcomp>
    inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
  File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\schema_statistics.py", line 185, in _get_array_type
    data_type = pandas_engine.Engine.dtype(inferred_alias)
  File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandera\engines\pandas_engine.py", line 155, in dtype
    np_or_pd_dtype = pd.api.types.pandas_dtype(data_type)
  File "C:\Users\\AppData\Local\pypoetry\Cache\virtualenvs\ge-test-EiC6WPHn-py3.8\lib\site-packages\pandas\core\dtypes\common.py", line 1777, in pandas_dtype
    npdtype = np.dtype(dtype)
TypeError: data type 'date' not understood

Expected behavior

Hoping basic date32 types can be handled along with timestamps.

Desktop (please complete the following information):

OS: Windows
pandera: 0.11.0
pandas 1.4.2
pyarrow: 8.0.0

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

jeffzi commented 2 years ago

Hi @andycarter85. Thanks for the reproducible example.

Pyarrow supports date types (date32 and date64) but pandas does not. Pandas only supports date times with numpy.datetime64.

However, pandas does allow to wrap any python object in the numpy.object data type. That's how pyarrow can translate its own date types to pandas.

If I expand your example:

df2.info()
#> <class 'pandas.core.frame.DataFrame'>
#> RangeIndex: 1 entries, 0 to 0
#> Data columns (total 1 columns):
#>  #   Column  Non-Null Count  Dtype 
#> ---  ------  --------------  ----- 
#>  0   mydate  1 non-null      object
#> dtypes: object(1)
#> memory usage: 136.0+ bytes

# looking at the first element of the "date32" column 
print(f"{type(df2.iloc[0,0])=}")
#> type(df2.iloc[0,0])=<class 'datetime.date'>

You see that pd.read_parquet (pyarrow under the hood) has translated date32 to python's standard datetime.date.

We recently added support for logical data types, a mechanism to cover extra data types not officially supported by pandas. For example, we added a Decimal data type, which pyarrow supports but is also boxed in an object column. Logical data types should be part of the next release.

@cosmicBboy @andycarter85 tl;dr: I can add support for a Date logical type to extend the coverage of pyarrow types.

cosmicBboy commented 2 years ago

I can add support for a Date logical type to extend the coverage of pyarrow types.

Yes, that would be awesome!

@andycarter85 does using the object type work for you as a temporary workaround?

andycarter85 commented 2 years ago

I'm not very familiar with pandera at the moment, is there a way I can adapt my infer_schema call temporarily until support for Date types is introduced?

cosmicBboy commented 2 years ago

@andycarter85 how are you using infer_schema in your workflow?

andycarter85 commented 2 years ago

I am just getting started with pandera tbh, we have some large pre-existing datasets that I wanted to try inferring a yaml schema for, and then iterate from there, rather than start building a schema from scratch .

As it looks like the issue has been resolved in #887 then happy to wait for the next release and try again then.

cosmicBboy commented 2 years ago

Not sure if the pa.date32 type is important in your use case, but a workaround here would be to convert all the columns containing dates into pandas-supported datetime64 before calling infer_schema.

import pandas as pd
import pyarrow as pa
from io import BytesIO
import pandera

df = pd.DataFrame([pd.Timestamp.now().date()], columns=['mydate'])

pqtypes = {
    'mydate': pa.date32(),
}

buffer = BytesIO()
df.to_parquet(
    buffer,
    engine='pyarrow',
    schema=pa.schema([pa.field(x, y) for x, y in pqtypes.items()])
    )

buffer.seek(0)

df2 = pd.read_parquet(buffer).astype({k: "datetime64[ns]" for k in pqtypes})
schema = pandera.infer_schema(df2)
print(schema.to_script())

Another thing to do would be to register a custom dtype (see https://pandera.readthedocs.io/en/stable/dtypes.html#example) but it would inherit from pandas_enginer.DateTime. The coerce method would then handle the conversion of datetime.date objects into pandas-supported datetime64[ns].

unionai-oss / pandera