unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.12k stars 294 forks source link

Feat: Adding more pyarrow types to pandas engine #1676

Open aaravind100 opened 2 weeks ago

aaravind100 commented 2 weeks ago

Is your feature request related to a problem? Please describe.

I'd like to continue to add some of the remaining pyarrow types to the pandas engine. I've come across these two apart from the existing types.

Describe the solution you'd like

Extend pandas_engine with ArrowList and ArrowStruct types.

I do have a working prototype here and can raise a pr.

Additional context

Would you like to add or prioritize some other types from here?

cosmicBboy commented 6 days ago

Hi @aaravind100 the prototype looks good, can you make a PR? Will just have to add some unit tests.

Would you like to add or prioritize some other types from

I'll leave that to you and others in the community to prioritize :) Which ones are left that are currently unsupported?

aaravind100 commented 6 days ago

@cosmicBboy created pr #1699

I'll leave that to you and others in the community to prioritize :) Which ones are left that are currently unsupported?

These types are compatible with pandas which are not added. I'll try adding some next week.

MarcSkovMadsen commented 3 days ago

+1. Came looking for date64.

Workaround

The below seems to work as a workaround for me for now.

import pandas as pd
import pandera as pa
import datetime as dt

from pandera.engines.pandas_engine import Engine, immutable, pd, pyarrow, dtypes, DataType

@Engine.register_dtype(
    equivalents=[
        "date64[pyarrow]",
        pyarrow.date64,
        pd.ArrowDtype(pyarrow.date64()),
    ]
)
@immutable
class ArrowDate64(DataType, dtypes.Date):
    """Semantic representation of a :class:`pyarrow.date64`."""

    type = pd.ArrowDtype(pyarrow.date64())
    bit_width: int = 64

class DFSchema(pa.DataFrameModel):
    """Schema for a dataframe of jobs from the endpoint

    https://algodon.de-prod.dk/api/hadrian/joblist/{environment}
    """

    model: str = pa.Field()
    notationtime: ArrowDate64 = pa.Field()
    value: int = pa.Field()

df = pd.DataFrame({
    "model": ["A", "B", "A", "B"],
    "notationtime": ["2024-01-01", "2024-01-01", "2024-01-02", "2024-01-02"],
    "value": [1,2,3,4]
})
df.notationtime=pd.to_datetime(df.notationtime).astype("date64[pyarrow]")

DFSchema(df)