Closed moleary-earnest closed 2 years ago
I have tried the following 2 approaches, neither work (both result in the same error):
# Annotated
UTCTimestamp = Annotated[pd.DatetimeTZDtype, "ns", "UTC"]
class ModelOutput(pa.SchemaModel):
"""
Output model for any lag reduction / forecasting model.
"""
point_in_time: Series[UTCTimestamp] = pa.Field(
description="Point in time to be used as a key",
)
yhat: Series[float] = pa.Field(
description="Estimated value",
nullable=True,
)
# dtype_kwargs
class ModelOutput(pa.SchemaModel):
"""
Output model for any lag reduction / forecasting model.
"""
point_in_time: Series[pd.DatetimeTZDtype] = pa.Field(
description="Point in time to be used as a key",
dtype_kwargs={"unit": "ns", "tz": "UTC"},
)
yhat: Series[float] = pa.Field(
description="Estimated value",
nullable=True,
)
SchemaError: expected series 'point_in_time' to have type datetime64[ns], got datetime64[ns, UTC]
Just to add to the info.
I am very confused because when I print out the schema I get the following:
<Schema DataFrameSchema(
columns={
'point_in_time': <Schema Column(name=point_in_time, type=DataType(datetime64[ns, UTC]))>
'optimized_date': <Schema Column(name=optimized_date, type=DataType(datetime64[ns, UTC]))>
'yhat': <Schema Column(name=yhat, type=DataType(float64))>
'yhat_lower': <Schema Column(name=yhat_lower, type=DataType(float64))>
'yhat_upper': <Schema Column(name=yhat_upper, type=DataType(float64))>
},
checks=[],
coerce=False,
dtype=None,
index=None,
strict=False
name=None,
ordered=False
)>
So the schema is expecting a datetime[ns, UTC]
type right?
Hi @moleary-earnest you'll need to specify coerce=True at the column level (or in the Config definition) for pandera to coerce incoming data. Lemme know if that works!
Converting this into a discussion question. @moleary-earnest please mark my answer as the correct one if it solved your problem!
Question about pandera
Is there a way to specify specify a brute force / naive coercion approach for a series to end up with vanilla dates (no tz info)?
I prefer to use the class based
pa.SchemaModel
approach if possible.The reason I ask is I have been starting to use this library for validation (and coercion) as my data flows through a pipeline and I am using the following model to ensure the output at the end is a date column and a float column.
However, using
ModelOutput(output_data)
results in the following error:Thanks in advance!