pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.42k stars 17.84k forks source link

BUG: Timestamp 'fold' argument ignored when tz is provided as string/name #55932

Open kohlerjl opened 10 months ago

kohlerjl commented 10 months ago

Pandas version checks

Reproducible Example

import pandas as pd
import dateutil.tz
import zoneinfo

utc0 = pd.Timestamp('2023-11-05T08:30:00Z')
utc1 = pd.Timestamp('2023-11-05T09:30:00Z')

tz = dateutil.tz.gettz('US/Pacific')
assert pd.Timestamp(year=2023, month=11, day=5, hour=1, minute=30, fold=0, tz=tz) == utc0
assert pd.Timestamp(year=2023, month=11, day=5, hour=1, minute=30, fold=1, tz=tz) == utc1

tz = zoneinfo.ZoneInfo('US/Pacific')
assert pd.Timestamp(year=2023, month=11, day=5, hour=1, minute=30, fold=0, tz=tz) == utc0
assert pd.Timestamp(year=2023, month=11, day=5, hour=1, minute=30, fold=1, tz=tz) == utc1

tz = 'US/Pacific'
assert pd.Timestamp(year=2023, month=11, day=5, hour=1, minute=30, fold=0, tz=tz) == utc0
assert pd.Timestamp(year=2023, month=11, day=5, hour=1, minute=30, fold=1, tz=tz) == utc1

Issue Description

The fold argument to the Timestamp constructor appears to be ignored when tz is provided as a string, but works as expected for the corresponding dateutil.tz or zoneinfo objects.

On the current development branch, I get an AmbiguousTimeError error on the last two asserts

---------------------------------------------------------------------------
AmbiguousTimeError                        Traceback (most recent call last)
Cell In[1], line 17
     14 assert pd.Timestamp(year=2023, month=11, day=5, hour=1, minute=30, fold=1, tz=tz) == utc1
     16 tz = 'US/Pacific'
---> 17 assert pd.Timestamp(year=2023, month=11, day=5, hour=1, minute=30, fold=0, tz=tz) == utc0
     18 assert pd.Timestamp(year=2023, month=11, day=5, hour=1, minute=30, fold=1, tz=tz) == utc1

File timestamps.pyx:1882, in pandas._libs.tslibs.timestamps.Timestamp.__new__()

File conversion.pyx:328, in pandas._libs.tslibs.conversion.convert_to_tsobject()

File conversion.pyx:399, in pandas._libs.tslibs.conversion.convert_datetime_to_tsobject()

File conversion.pyx:658, in pandas._libs.tslibs.conversion._localize_pydatetime()

File ~/venv/lib/python3.11/site-packages/pytz/tzinfo.py:366, in DstTzInfo.localize(self, dt, is_dst)
    360 # If we get this far, we have multiple possible timezones - this
    361 # is an ambiguous case occurring during the end-of-DST transition.
    362 
    363 # If told to be strict, raise an exception since we have an
    364 # ambiguous case
    365 if is_dst is None:
--> 366     raise AmbiguousTimeError(dt)
    368 # Filter out the possiblilities that don't match the requested
    369 # is_dst
    370 filtered_possible_loc_dt = [
    371     p for p in possible_loc_dt if bool(p.tzinfo._dst) == is_dst
    372 ]

AmbiguousTimeError: 2023-11-05 01:30:00

This behavior is at least better than the current release (2.1,2), which fails with an AssertionError because pd.Timestamp(year=2023, month=11, day=5, hour=1, minute=30, fold=0, tz=tz) returns the incorrect timestamp Timestamp('2023-11-05 01:30:00-0800', tz='US/Pacific')

Expected Behavior

I would expect the behavior of interpreting ambiguous timestamps with 'fold' provided to be the same when the timezone is defined as a string (e.g. tz='US/Pacific') as when using the equivalent zoneinfo or dateutil.tz timezone. I noticed that the 'fold' argument is not permitted when using a pytz timezone, but at least in that case a descriptive error is provided.

Installed Versions

INSTALLED VERSIONS ------------------ commit : b2d9ec17c52084ee2b629633c9119c01ea11d387 python : 3.11.5.final.0 python-bits : 64 OS : Linux OS-release : 6.6.1-arch1-1 Version : #1 SMP PREEMPT_DYNAMIC Wed, 08 Nov 2023 16:05:38 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.utf8 LOCALE : en_US.UTF-8 pandas : 2.2.0.dev0+564.gb2d9ec17c5 numpy : 1.26.2 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 23.2.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.17.2 pandas_datareader : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.11.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
jiffyclub commented 3 months ago

Coming in with another example and traceback from Pandas 2.2.2.

Exception:

In [50]: pd.Timestamp(dt.datetime(2022, 11, 6, 1, 6, 58), tz='America/New_York', fold=0)
---------------------------------------------------------------------------
AmbiguousTimeError                        Traceback (most recent call last)
Cell In[50], line 1
----> 1 pd.Timestamp(dt.datetime(2022, 11, 6, 1, 6, 58), tz='America/New_York', fold=0)

File timestamps.pyx:1865, in pandas._libs.tslibs.timestamps.Timestamp.__new__()

File conversion.pyx:412, in pandas._libs.tslibs.conversion.convert_to_tsobject()

File conversion.pyx:483, in pandas._libs.tslibs.conversion.convert_datetime_to_tsobject()

File conversion.pyx:748, in pandas._libs.tslibs.conversion._localize_pydatetime()

File ~/mambaforge/envs/populus-env/lib/python3.10/site-packages/pytz/tzinfo.py:366, in DstTzInfo.localize(self, dt, is_dst)
    360 # If we get this far, we have multiple possible timezones - this
    361 # is an ambiguous case occurring during the end-of-DST transition.
    362
    363 # If told to be strict, raise an exception since we have an
    364 # ambiguous case
    365 if is_dst is None:
--> 366     raise AmbiguousTimeError(dt)
    368 # Filter out the possiblilities that don't match the requested
    369 # is_dst
    370 filtered_possible_loc_dt = [
    371     p for p in possible_loc_dt if bool(p.tzinfo._dst) == is_dst
    372 ]

AmbiguousTimeError: 2022-11-06 01:06:58

Works:

In [49]: pd.Timestamp(dt.datetime(2022, 11, 6, 1, 6, 58), tz=dateutil.tz.gettz('America/New_York'), fold=0)
Out[49]: Timestamp('2022-11-06 01:06:58-0400', tz='dateutil//usr/share/zoneinfo/America/New_York')

Looks like we may be running into a known issue: https://github.com/pandas-dev/pandas/blob/a5e812d86deb62872f8d514d894a22931fc84217/pandas/_libs/tslibs/conversion.pyx#L747-L748

Thanks @kohlerjl for the workaround!