pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.81k stars 17.99k forks source link

MultiIndex.from_product converts datetime.date to pd.Timestamp #28152

Open MaximeWeyl opened 5 years ago

MaximeWeyl commented 5 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd

a = pd.Timestamp("2019-2-2").date()
i = pd.MultiIndex.from_product([
    [a, a],
    [2, 3]
])

print("a={} ({}".format(a, type(a)))
print(i[0])

Output is :

a=2019-02-02 (<class 'datetime.date'>
(Timestamp('2019-02-02 00:00:00'), 2)

Problem description

When using from_product with python datetimes, the resulting MultiIndex level is converted to pandas datetimes (Timestamps). There is no way to keep the original python datetime which I want. I got the same behavior with from_tuples. But it was not the case with from_arrays.

Expected Output

I expect the output to respect the type I gave to from_product :

a=2019-02-02 (<class 'datetime.date'>
(datetime.date(2019, 2, 2), 2)

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 0.25.1 numpy : 1.17.0 pytz : 2019.2 dateutil : 2.8.0 pip : 19.0.3 setuptools : 40.8.0 Cython : None pytest : 4.6.5 hypothesis : None sphinx : 2.2.0 blosc : None feather : None xlsxwriter : None lxml.etree : 4.4.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10.1 IPython : 7.7.0 pandas_datareader: None bs4 : 4.8.0 bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.4.1 matplotlib : 3.1.1 numexpr : None odfpy : None openpyxl : 2.6.3 pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.3.1 sqlalchemy : None tables : None xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : None
MaximeWeyl commented 5 years ago

One workaround I found is to pass Index to from_product instead of lists :


import pandas as pd

a = pd.Timestamp("2019-2-2").date()
i = pd.MultiIndex.from_product([
    pd.Index([a, a]),
    [2, 3]
])

print("a={} ({}".format(a, type(a)))
print(i[0])

Output :


a=2019-02-02 (<class 'datetime.date'>
(datetime.date(2019, 2, 2), 2)
TomAugspurger commented 5 years ago

I expect the output to respect the type I gave to from_product :

You don't specify a dtype, right? So this is a bug in inference.

I would expect the MultiIndex constructors to follow the behavior of Index (and Series), which preserve the datetime.

In [24]: pd.Index([a, a])
Out[24]: Index([2019-02-02, 2019-02-02], dtype='object')

In [25]: pd.Index([a, a])[0]
Out[25]: datetime.date(2019, 2, 2)
TomAugspurger commented 5 years ago

Hmm the bug seems to be in Categorical (used by MI internally)

In [31]: pd.Categorical([a]).categories
Out[31]: DatetimeIndex(['2019-02-02'], dtype='datetime64[ns]', freq=None)
mroeschke commented 5 years ago

Just noting the distinction that this is an issue with datetime.date objects, which are not first class in pandas.

kurtosis commented 4 years ago

I got tripped up on this as well when using groupby on a datetime.date column and moving it in and out of the index. The different behavior between Index and MultiIndex is especially tricky.

One suggestion: maybe add documentation/warning on the preferred way to round timestamps to date? This is a common operation and dt.date was all I came across in the docs. https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.date.html

(See also https://github.com/pandas-dev/pandas/issues/15906)

In [4]: df = pd.DataFrame({'date' : pd.date_range(start='2020-01-01', periods=10), 'label' : 1})                                                                              
In [5]: df['date'] = df['date'].dt.date                                                                                                                                       
In [6]: display(df.date[0])                                                                                                                                                   
datetime.date(2020, 1, 1)
In [7]: df_1 = df.set_index('date').reset_index()                                                                                                                             
In [8]: display(df_1.date[0])                                                                                                                                                 
datetime.date(2020, 1, 1)
In [9]: df_1 = df.set_index(['date', 'label']).reset_index()                                                                                                                  
In [10]: display(df_1.date[0])                                                                                                                                                
Timestamp('2020-01-01 00:00:00')