Closed BakerAugust closed 3 years ago
I noticed this issue today as well. I'm not sure if from_str_alias supports datetimes with timezones. I get this same error with infer_schema function call on a dataset that has UTC timestamp listed as string alias as 'datetime64[ns, UTC]'.
hey @BakerAugust thanks for filing this issue! looking into it now
I have a similar issue (if not the same):
from pandera import Column, DataFrameSchema
import pandas as pd
s = DataFrameSchema(
{
'timestamp': Column(dtype=pd.DatetimeTZDtype(tz="UTC"))
})
print(s.example())
Returns TypeError: Invalid datetime unit in metadata string "[ns, utc]"
I could reproduce @th0ger's error with version v0.7.0.
Starting in v0.7.0, pandas.DatetimeTZDtype
is recognized by the pandera DataType hierarchy. Now, the problem is that strategies leverage the hypothesis library which requires numpy dtypes. Numpy cannot cast a pandas.DatetimeTZDtype
to a regular datetime64
. In fact, numpy does not support timezone-aware datetime at all.
import numpy as np
import pandas as pd
from pandera.engines.pandas_engine import Engine
pandera_dtype = Engine.dtype(pd.DatetimeTZDtype(tz="UTC"))
alias = str(pandera_dtype)
print(alias)
#> datetime64[ns, UTC]
np.dtype(alias)
#> Traceback (most recent call last):
#> ----> 1 np.dtype(alias)
#> TypeError: Invalid datetime unit in metadata string "[ns, UTC]"
It should be possible to work around that limitation by downcasting to naive datetime64 in pandera.strategies.column/index_strategy
and casting to DatetimeTZDtype
when the series is generated.
@cosmicBboy I can give it a shot.
Awesome, thanks @jeffzi!
A bit off topic: I'm leveling up on pandera, hypothesis, and property based testing (PBT) in general. This is great. I also assume that contributers to this project are test geeks, taking your own PBT medicine 😀.
So, v0.7.0 added support for pandas.DatetimeTZDtype
, and it turned out to work for validation but not strategies. So I'm curious why the tests didn't capture this bug before release?
v0.7.0 introduced a complete rework of how internal dtypes representation. Strategies are a separate feature, tested independently.
Concretely, tests are applied to a list of pandera.dtypes.DataType
that can be translated to numpy equivalents. That list only contains dtypes initialized with the default constructor. Unfortunately pandera.engines.pandas_engine.DateTime
(internal representation for pandas.DatetimeTZDtype
) is not timezone-aware by default.
https://github.com/pandera-dev/pandera/blob/abc817feffc736acd8978cde7e7cf50c3cad2983/tests/strategies/test_strategies.py#L30-L40
Just adding pandas_engine.Engine.dtype("datetime64[ns, UTC]")
to SUPPORTED_DTYPES
is enough to add it to the current strategies test suite. I'm also writing specialized tests for that particular dtype because it is treated slightly differently than other dtypes in order to play nicely with hypothesis.
I should be able to push a PR in the coming days.
@th0ger @cantos688 @BakerAugust The fix for this bug (#595) has been merged !
The provided test case is fixed (pandera 0.7.1). Tested it for non-UTC as well:
from pandera import Column, DataFrameSchema
import pandas as pd
dtype = pd.DatetimeTZDtype(tz="US/Central")
print(dtype)
s = DataFrameSchema(
{
'timestamp': Column(dtype=dtype)
})
print(s.example())
datetime64[ns, US/Central]
timestamp
0 2187-08-13 05:05:14.571410014-06:00
1 1969-12-31 18:00:00.000064212-06:00
2 2167-12-08 07:29:01.542986145-06:00
Worth mentioning: If we use dtype = pd.DatetimeTZDtype(tz='dateutil/US/Central')
, taken from the pandas.DatetimeTZDtype) documentation, I get TypeError: Invalid datetime unit in metadata string "[ns, tzfile('/usr/share/zoneinfo/US/Central')]"
@th0ger Thanks for testing, and reporting the dateutil case.
Historically pandera relied on string aliases to coerce dtypes but str(pd.DatetimeTZDtype(tz='dateutil/US/Central'))
is not recognized by pandas:
import pandas as pd
data = pd.Series(["2021-01-01 00:00"])
# ok
data.astype(pd.DatetimeTZDtype(tz="dateutil/US/Central"))
#> 0 2021-01-01 00:00:00-06:00
dtype: datetime64[ns, tzfile('/usr/share/zoneinfo/US/Central')]
# fail
pandas_dtype = str(pd.DatetimeTZDtype(tz="dateutil/US/Central"))
try:
data.astype(pandas_dtype)
except TypeError:
print(pandas_dtype)
#> datetime64[ns, tzfile('/usr/share/zoneinfo/US/Central')]
I've pushed a fix for strategies (#620) to avoid using string aliases and verified that it fixed your case.
@jeffzi leveled up for fast response: +1
Question about pandera
Hi there, Thanks for the package. Love what you are doing with this project. I'm posing this as a question because I am unsure if this is a bug, an implemenatation error on my end, or expected behavior.
I am trying to use the parameterized dtype
pd.DatetimeTZDtype
in aSchemaModel
. When using the resultingSchemaModel
to validate other dataframes it works as expected, but when using thatSchemaModel
as a strategy, I am seeing the following error:Here's some test code