pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.67k stars 17.91k forks source link

BUG: Timestamp fails when fold is passed and building from components as positional args (Pandas 2.0 regression) #52117

Open lafrech opened 1 year ago

lafrech commented 1 year ago

Pandas version checks

Reproducible Example

import datetime as dt
import pandas as pd

pd.Timestamp(2020, 1, 1, 0, 0, tzinfo=dt.timezone.utc, fold=0)

Issue Description

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/_libs/tslibs/timestamps.pyx", line 1584, in pandas._libs.tslibs.timestamps.Timestamp.__new__
ValueError: Cannot pass fold with possibly unambiguous input: int, float, numpy.datetime64, str, or timezone-aware datetime-like. Pass naive datetime-like or build Timestamp from components.

Code works with Pandas 1.5. Possible Pandas 2.0 regression?

Search yielded #44378 except I do build from components.

When passing components as kwargs, it works:

pd.Timestamp(year=2020, month=1, day=1, hour=0, minute=0, tzinfo=dt.timezone.utc, fold=0)
Timestamp('2020-01-01 00:00:00+0000', tz='UTC')

I figured maybe I'm not using the constructor correctly but I couldn't find a mention in the docs preventing this usage.

The only test I found regarding this is this one:

https://github.com/pandas-dev/pandas/blob/ee84ef2753f396b6b9edb514dc2d1a72f78bd1ed/pandas/tests/scalar/timestamp/test_constructors.py#L839-L847

It is this test that led me to the "pass all as kwargs" solution.

Is this a bug or a known limitation?

Expected Behavior

Timestamp('2020-01-01 00:00:00+0000', tz='UTC')

Installed Versions

INSTALLED VERSIONS ------------------ commit : c2a7f1ae753737e589617ebaaff673070036d653 python : 3.9.2.final.0 python-bits : 64 OS : Linux OS-release : 5.10.0-21-amd64 Version : #1 SMP Debian 5.10.162-1 (2023-01-21) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : fr_FR.UTF-8 LOCALE : fr_FR.UTF-8 pandas : 2.0.0rc1 numpy : 1.24.2 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 65.3.0 pip : 23.0.1 Cython : None pytest : 7.2.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.5 jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : 2.0.4 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : None qtpy : None pyqt5 : None
mroeschke commented 1 year ago

From the original implementation in https://github.com/pandas-dev/pandas/commit/6e04264250f27787075dc0cb1685e20e7a6af071, I think the error message could have been clarified as components means from keyword only arguments

lafrech commented 1 year ago

Perhaps. This could be a won'tfix, then. Still it is a bit surprising.

# 1/ This works, following calls are equivalent AFAIU
Timestamp(2019, 10, 27, 1, 30, tz=tz)
Timestamp(year=2019, month=10, day=27, hour=1, minute=30, tz=tz)

# 2/ This works
Timestamp(year=2019, month=10, day=27, hour=1, minute=30, tz=tz, fold=fold)

# 3/ This does not work
Timestamp(2019, 10, 27, 1, 30, tz=tz, fold=fold)

Why? Pandas understands the positional args, so how come 3/ is not equivalent to 2/? I may lack time/skills to dig into the code and understand why, but it is not intuitive to me.

Besides, it worked with Pandas 1.5.

In any case, the error message is misleading. And probably the docs too.

mroeschke commented 1 year ago

@AlexKirko do you happen to remember when adding fold, were other Timestamp components supposed to be added as keyword only arguments when specifying fold?

I'm trying to determine if originally in https://github.com/pandas-dev/pandas/pull/31563 if 3/ was never supposed to work even in 1.5

AlexKirko commented 1 year ago

Hello, this is what I can remember or could dig up:

In the examples that we documented then, we instruct the user to build from a naive datetime or keywords:

   pd.Timestamp(datetime.datetime(2019, 10, 27, 1, 30, 0, 0),
                tz='dateutil/Europe/London', fold=0)
   pd.Timestamp(year=2019, month=10, day=27, hour=1, minute=30,
                tz='dateutil/Europe/London', fold=1)

We never say that the user can't build from positional arguments.

Now, the code responsible for the check is here:

        if fold is not None:
            if fold not in [0, 1]:
                raise ValueError(
                    "Valid values for the fold argument are None, 0, or 1."
                )

            if (ts_input is not _no_input and not (
                    PyDateTime_Check(ts_input) and
                    getattr(ts_input, "tzinfo", None) is None)):
                raise ValueError(
                    "Cannot pass fold with possibly unambiguous input: int, "
                    "float, numpy.datetime64, str, or timezone-aware "
                    "datetime-like. Pass naive datetime-like or build "
                    "Timestamp from components."
                )

From what I remember and see in the code, we intended to raise an exception here only when we are trying to build from a non-ambiguous datetime-like input: one that passes PyDateTime_Check and has tzinfo (or does not pass PyDateTime_Check, I suppose).

I would say, the behavior in 2.0 is a bug. If we support the positional argument convention to mimic datetime, then passing the equivalent of a naive datetime through positional arguments should behave the same as a naive datetime.

My guess would be that something in 2.0 makes us pass a ts_input that is not _no_input in @lafrech 's example? Could try to chase this down during the weekend. @mroeschke , what do you think?

lafrech commented 1 year ago

If we support the positional argument convention to mimic datetime, then passing the equivalent of a naive datetime through positional arguments should behave the same as a naive datetime.

Spot on. That would be consistent, indeed.

Could try to chase this down during the weekend.

Thanks @AlexKirko.

AlexKirko commented 1 year ago

Hm, on 1.5.3 fold is actually None before this check (on main it's 0):

        # Allow fold only for unambiguous input
        if fold is not None:
            if fold not in [0, 1]:
                raise ValueError(
                    "Valid values for the fold argument are None, 0, or 1."
                )

Also, on 1.5.3 ts_input is a datetime.datetime object, and on main it's an int (just has the year in it). We then try to get a timezone from an int, and this is where we error out.

I'll go through the constructor code and see what changed.

AlexKirko commented 1 year ago

Somehow we used to go into this if in this situation:

        if tzinfo is not None:
            # GH#17690 tzinfo must be a datetime.tzinfo object, ensured
            #  by the cython annotation.
            if tz is not None:
                if (is_integer_object(tz)
                    and is_integer_object(ts_input)
                    and is_integer_object(freq)
                ):
                    # GH#31929 e.g. Timestamp(2019, 3, 4, 5, 6, tzinfo=foo)
                    # TODO(GH#45307): this will still be fragile to
                    #  mixed-and-matched positional/keyword arguments
                    ts_input = datetime(
                        ts_input,
                        freq,
                        tz,
                        unit or 0,
                        year or 0,
                        month or 0,
                        day or 0,
                        fold=fold or 0,
                    )
                    nanosecond = hour
                    tz = tzinfo
                    print("auxilary constructor")
                    print(ts_input)
                    return cls(ts_input, nanosecond=nanosecond, tz=tz)

                raise ValueError('Can provide at most one of tz, tzinfo')

Now we no longer do, which looks correct, but where before we would pass a timezone-aware datetime without the fold argument, we now keep going and pass positional arguments.

AlexKirko commented 1 year ago

Looks like the culprit is the predicate. We currently have:

            if (ts_input is not _no_input and not (
                    PyDateTime_Check(ts_input) and
                    getattr(ts_input, "tzinfo", None) is None)):

Which is equivalent to:

            if (ts_input is not _no_input and (
                    not PyDateTime_Check(ts_input) or
                    getattr(ts_input, "tzinfo", None) is not None)):

I think we should have:

            if (ts_input is not _no_input and
                    PyDateTime_Check(ts_input) and
                    getattr(ts_input, "tzinfo", None) is not None):

This fixes the OP's problem, let me see is this passes the test suite.

AlexKirko commented 1 year ago

Of course not. Anyway, I'll find a predicate variant that does what we want and passes the tests.

AlexKirko commented 1 year ago

Also, this looks like a kludge to me:

        elif is_integer_object(year):
            # User passed positional arguments:
            # Timestamp(year, month, day[, hour[, minute[, second[,
            # microsecond[, tzinfo]]]]])
            ts_input = datetime(ts_input, year, month, day or 0,
                                hour or 0, minute or 0, second or 0, fold=fold or 0)
            unit = None

In fact, ts_input is the year (which is what the datetime constructor expects).

print(ts_input)
print(_date_attributes)
> 2020
> [1, 1, 0, 0, None, None, None, None]

This should probably be carefully fixed in a separate PR, if it's not just me finding weird that month is assigned to the variable named year.