unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.37k stars 310 forks source link

What is the best way to handle dates in a SchemaModel? #829

Closed moleary-earnest closed 2 years ago

moleary-earnest commented 2 years ago

Question about pandera

Is there a way to specify specify a brute force / naive coercion approach for a series to end up with vanilla dates (no tz info)?

I prefer to use the class based pa.SchemaModel approach if possible.

The reason I ask is I have been starting to use this library for validation (and coercion) as my data flows through a pipeline and I am using the following model to ensure the output at the end is a date column and a float column.

class ModelOutput(pa.SchemaModel):
    """
    Output model for any lag reduction / forecasting model.
    """

    point_in_time: Series[pa.DateTime] = pa.Field(
        description="Point in time to be used as a key",
    )

    yhat: Series[float] = pa.Field(
        description="Estimated value",
        nullable=True,
    )

However, using ModelOutput(output_data) results in the following error:

SchemaError: expected series 'point_in_time' to have type datetime64[ns], got datetime64[ns, UTC]

Thanks in advance!

moleary-earnest commented 2 years ago

I have tried the following 2 approaches, neither work (both result in the same error):

# Annotated
UTCTimestamp = Annotated[pd.DatetimeTZDtype, "ns", "UTC"]
class ModelOutput(pa.SchemaModel):
    """
    Output model for any lag reduction / forecasting model.
    """

    point_in_time: Series[UTCTimestamp] = pa.Field(
        description="Point in time to be used as a key",
    )
    yhat: Series[float] = pa.Field(
        description="Estimated value",
        nullable=True,
    )

# dtype_kwargs
class ModelOutput(pa.SchemaModel):
    """
    Output model for any lag reduction / forecasting model.
    """

    point_in_time: Series[pd.DatetimeTZDtype] = pa.Field(
        description="Point in time to be used as a key",
        dtype_kwargs={"unit": "ns", "tz": "UTC"},
    )
    yhat: Series[float] = pa.Field(
        description="Estimated value",
        nullable=True,
    )

SchemaError: expected series 'point_in_time' to have type datetime64[ns], got datetime64[ns, UTC]

moleary-earnest commented 2 years ago

Just to add to the info.

I am very confused because when I print out the schema I get the following:

<Schema DataFrameSchema(
    columns={
        'point_in_time': <Schema Column(name=point_in_time, type=DataType(datetime64[ns, UTC]))>
        'optimized_date': <Schema Column(name=optimized_date, type=DataType(datetime64[ns, UTC]))>
        'yhat': <Schema Column(name=yhat, type=DataType(float64))>
        'yhat_lower': <Schema Column(name=yhat_lower, type=DataType(float64))>
        'yhat_upper': <Schema Column(name=yhat_upper, type=DataType(float64))>
    },
    checks=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False
    name=None,
    ordered=False
)>

So the schema is expecting a datetime[ns, UTC] type right?

cosmicBboy commented 2 years ago

Hi @moleary-earnest you'll need to specify coerce=True at the column level (or in the Config definition) for pandera to coerce incoming data. Lemme know if that works!

cosmicBboy commented 2 years ago

Converting this into a discussion question. @moleary-earnest please mark my answer as the correct one if it solved your problem!