pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.26k stars 17.8k forks source link

ENH: extend pd.api.types with an is_datelike_dtype #54742

Open JMBurley opened 1 year ago

JMBurley commented 1 year ago

Feature Type

Problem Description

Testing types in pandas is incomplete for datelikes.

I would be useful to maintain a pd.api.types.is_datelike_dtype that can interpret any datelike (eg. datetime64, pyarrow timestamp, timestamp, dbdate) in the same way as does pd.api.types.is_numeric for numerics.

reproducible example:

import pandas as pd

n_rows = 5
df = pd.DataFrame({'datetimes': pd.date_range('2015-02-24', periods=n_rows, freq='H'),
                    'values': range(n_rows)
                        })
print(df.dtypes)
is_datelike = pd.api.types.is_datetime64_any_dtype(df['datetimes'])
is_numeric = pd.api.types.is_numeric_dtype(df['values'])
print(f'{is_datelike = }')
print(f'{is_numeric = }\n')

df = df.convert_dtypes(dtype_backend='pyarrow')
print(df.dtypes)
is_datelike = pd.api.types.is_datetime64_any_dtype(df['datetimes'])
is_numeric = pd.api.types.is_numeric_dtype(df['values'])
print(f'{is_datelike = }')
print(f'{is_numeric = }')

gives

datetimes    datetime64[ns]
values                int64
dtype: object
is_datelike = True
is_numeric = True

datetimes    timestamp[ns][pyarrow]
values               int64[pyarrow]
dtype: object
is_datelike = False
is_numeric = True

Feature Description

# PSEUDOCODE
def is_datelike_dtype(s:pd.Series) -> bool:
    if {s is a datelike dtype}:
         return True
    else:
         return False

Alternative Solutions

Additional Context

No response

JMBurley commented 1 year ago

I will note that I have an incomplete system that gives roughly the right behaviour, but I hope we can find something faster & more maintainable:

if pd.api.types.is_object_dtype(test_series) or pd.api.types.is_datetime64_any_dtype(test_series) \
        or (test_series.dtype == 'dbdate') or (test_series.dtype == 'timestamp[ns][pyarrow]'):
    # Objects or dt64 or dbdate columns can contain valid dates.
    # 'dbdate' & 'timestamp[ns][pyarrow]' are not yet pd.api testable (pd 2.0.0)
    # (nb./ to_datetime will work on float/int but generate garbage unless the value is epoch time.
    # As I don't have a reliable way to know if numerics are valid dates, ignore them. User should force them to dt)
    try:
        sample_len = np.min([len(test_series), 50])
        pd.to_datetime(test_series.sample(sample_len))  # check date viability on subset
        is_datelike = True
    except (ValueError, TypeError, OverflowError):
        # ValueError: not dt-like, or pd._libs.tslibs.np_datetime.OutOfBoundsDatetime
        # TypeError: Cell within series cannot be cast to datetime
        # OverflowError: tried to convert something that became larger than a 64 bit int
        is_datelike = False
else:
        is_datelike = False

PS. Also note that the above code permits strings to be dates if appropriately formatted, which may not be good behaviour. For instance, date methods can't work on a string column. The original intent of the above code was to coerce anything meaningfully datelike to datetime64