unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Coercing DateTimeIndex fails starting with Pandera v0.13.4 #1046

Closed davidandreoletti closed 1 year ago

davidandreoletti commented 1 year ago

Describe the bug

When passed the following date time index to DateTime._coerce(...)'s inline function _to_datetime

col = DatetimeIndex(['2021-10-24 00:00:00+00:00', '2021-10-25 00:00:00+00:00',
               '2021-10-26 00:00:00+00:00', '... 00:00:00+00:00', '2021-10-31 00:00:00+00:00'],
              dtype='datetime64[ns, UTC]', name='timestamp', freq=None)

with schema instance built from

class PriceDFSchema(pandera.SchemaModel):
    timestamp: Index[pd.DatetimeTZDtype] = Field(
        coerce=True, dtype_kwargs={"unit": "ns", "tz": "UTC"}
    )
    product_id: Series[pd.Int64Dtype] = Field(coerce=True)
    ...

and starting with #972, Pandera fails to coerce <class 'pandas.core.indexes.datetimes.DatetimeIndex'> data_container into type datetime64[ns, UTC] due to missing dt called here:

https://github.com/unionai-oss/pandera/blob/5aa77957615720741dd817b668154e9a72cf6c73/pandera/engines/pandas_engine.py#L816

Pandera/Pandas/Bug Status matrix:

Pandera 0.12.0 / Pandas 1.4.4-1.5.2 => coercion success Pandera 0.13.4 / Pandas 1.4.4-1.5.2 => coercion failure

The difference seems to be that the coercion happened on a series on #972 while in the code sample above it is on an index.

Workaround

Use Pandera 0.12.0

Code Sample, a copy-pastable example

See above.

Expected behavior

Coercing succeeds

Desktop (please complete the following information):

Screenshots

NA

Additional context

NA

davidandreoletti commented 1 year ago

The difference seems to be that the coercion happened on a series on https://github.com/unionai-oss/pandera/pull/972 while in the code sample above it is on an index.

@cristianmatache As #972 submitter, and in light with the text above, any idea why Index (vs Series) seems to have no 'dt' field ?

cristianmatache commented 1 year ago

@davidandreoletti it looks like i haven't handled that case properly, only Series' have .dt, indexes don't.

We should add an if statement to check whether hasattr(col, 'dt'), I think something like this should do

            if (
                hasattr(pandas_dtype, "tz")
                and pandas_dtype.tz is not None
            ):
                # localize datetime column so that it's timezone-aware
                if hasattr(col, 'dt') and col.dt.tz is None:  # Series
                    col = col.dt.tz_localize(pandas_dtype.tz, **_tz_localize_kwargs)
                elif col.tz is None:  # Index
                   col = col.tz_localize(pandas_dtype.tz, **_tz_localize_kwargs)
            return col.astype(pandas_dtype)
cristianmatache commented 1 year ago

I can have a look at this early next week, if you need the functionality before that, please feel free to submit a PR.

davidandreoletti commented 1 year ago

I can have a look at this early next week, if you need the functionality before that, please feel free to submit a PR.

Thanks @cristianmatache for the PR #1057 !

JungeAlexander commented 1 year ago

Resurfacing this as the issue still seems to persist and PR #1057 remains unmerged. @cristianmatache, could you please sign your commit as explained here?