Open andycarter85 opened 2 years ago
Hi @andycarter85. Thanks for the reproducible example.
Pyarrow supports date types (date32 and date64) but pandas does not. Pandas only supports date times with numpy.datetime64
.
However, pandas does allow to wrap any python object in the numpy.object
data type. That's how pyarrow can translate its own date types to pandas.
If I expand your example:
df2.info()
#> <class 'pandas.core.frame.DataFrame'>
#> RangeIndex: 1 entries, 0 to 0
#> Data columns (total 1 columns):
#> # Column Non-Null Count Dtype
#> --- ------ -------------- -----
#> 0 mydate 1 non-null object
#> dtypes: object(1)
#> memory usage: 136.0+ bytes
# looking at the first element of the "date32" column
print(f"{type(df2.iloc[0,0])=}")
#> type(df2.iloc[0,0])=<class 'datetime.date'>
You see that pd.read_parquet
(pyarrow under the hood) has translated date32
to python's standard datetime.date
.
We recently added support for logical data types, a mechanism to cover extra data types not officially supported by pandas. For example, we added a Decimal data type, which pyarrow supports but is also boxed in an object column. Logical data types should be part of the next release.
@cosmicBboy @andycarter85 tl;dr: I can add support for a Date logical type to extend the coverage of pyarrow types.
I can add support for a Date logical type to extend the coverage of pyarrow types.
Yes, that would be awesome!
@andycarter85 does using the object
type work for you as a temporary workaround?
I'm not very familiar with pandera at the moment, is there a way I can adapt my infer_schema
call temporarily until support for Date types is introduced?
@andycarter85 how are you using infer_schema
in your workflow?
I am just getting started with pandera tbh, we have some large pre-existing datasets that I wanted to try inferring a yaml schema for, and then iterate from there, rather than start building a schema from scratch .
As it looks like the issue has been resolved in #887 then happy to wait for the next release and try again then.
Not sure if the pa.date32
type is important in your use case, but a workaround here would be to convert all the columns containing dates into pandas-supported datetime64
before calling infer_schema
.
import pandas as pd
import pyarrow as pa
from io import BytesIO
import pandera
df = pd.DataFrame([pd.Timestamp.now().date()], columns=['mydate'])
pqtypes = {
'mydate': pa.date32(),
}
buffer = BytesIO()
df.to_parquet(
buffer,
engine='pyarrow',
schema=pa.schema([pa.field(x, y) for x, y in pqtypes.items()])
)
buffer.seek(0)
df2 = pd.read_parquet(buffer).astype({k: "datetime64[ns]" for k in pqtypes})
schema = pandera.infer_schema(df2)
print(schema.to_script())
Another thing to do would be to register a custom dtype (see https://pandera.readthedocs.io/en/stable/dtypes.html#example) but it would inherit from pandas_enginer.DateTime
. The coerce
method would then handle the conversion of datetime.date
objects into pandas-supported datetime64[ns]
.
Describe the bug A clear and concise description of what the bug is.
Hoping that pandera can handle date32 types, but this appears to raise an error.
Code Sample, a copy-pastable example
Expected behavior
Hoping basic date32 types can be handled along with timestamps.
Desktop (please complete the following information):
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.