pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.84k stars 18k forks source link

Consistent dtype output for element wise operations on empty dataframes #32802

Open benmatwil opened 4 years ago

benmatwil commented 4 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
s = pd.Series([], dtype='datetime64[ns, UTC]')
df = pd.concat([s, s], axis=1) # empty dataframe of two columns of datetime dtype
# want the minimum of the two datetime columns element wise
min_dates = df.min(axis=1) # output series is of dtype float64

# it's fine if dataframe is non-empty
import pandas as pd
s = pd.Series(pd.to_datetime([f'2020-01-0{i}' for i in range(1, 10)], utc=True))
df = pd.concat([s, s], axis=1) # non-empty dataframe of two columns of datetime dtype
min_dates = df.min(axis=1) # output series is of dtype datetime64[ns, UTC]

Problem description

When dataframe is empty, doing a .min(axis=1) outputs a series of dtype float64 even though all columns were dtype datetime64. Should this be consitent on dtypes and output min_dates with dtype datetime64[ns, UTC].

This is only an issue when the dataframe is empty. If the dataframe is not empty, the dtype is conserved and min_dates outputs with dtype datetime64[ns, UTC].

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.6.3.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-862.3.3.el7.x86_64 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 1.0.2 numpy : 1.16.5 pytz : 2019.3 dateutil : 2.8.0 pip : 20.0.2 setuptools : 41.4.0 Cython : 0.29.13 pytest : 5.0.1 hypothesis : None sphinx : 2.1.2 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.8.3 (dt dec pq3 ext lo64) jinja2 : 2.10.1 IPython : 7.7.0 pandas_datareader: None bs4 : 4.8.1 bottleneck : None fastparquet : None gcsfs : 0.3.1 lxml.etree : None matplotlib : 3.1.1 numexpr : 2.6.9 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : 5.0.1 pyxlsb : None s3fs : None scipy : 1.3.1 sqlalchemy : 1.3.8 tables : 3.5.2 tabulate : None xarray : 0.14.0 xlrd : 1.2.0 xlwt : None xlsxwriter : None numba : 0.46.0
arw2019 commented 4 years ago

A slightly tweaked version of the original report:

import pandas as pd 
s1 = pd.Series([], dtype='datetime64[ns, UTC]') 
df1 = pd.concat([s1, s1], axis=1) # empty dataframe of two columns of datetime dtype 
# want the minimum of the two datetime columns element wise 
min_dates_1 = df1.min(axis=1) # output series is of dtype float64 
print(min_dates_1.dtype) 

s2 = pd.Series(pd.to_datetime([f'2020-01-0{i}' for i in range(1, 10)], utc=True)) 
df2 = pd.concat([s2, s2], axis=1) # non-empty dataframe of two columns of datetime dtype 
min_dates_2 = df2.min(axis=1) # output series is of dtype datetime64[ns, UTC] 
print(min_dates_2.dtype) 

Also - I still get this on the master version of pandas.

Output of pd.show_versions() INSTALLED VERSIONS ------------------ commit : 9e1b95f721de87d83a2644f49a39528b4b81f536 python : 3.8.2.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-106-generic Version : #107-Ubuntu SMP Thu Jun 4 11:27:52 UTC 2020 machine : x86_64 processor : byteorder : little LC_ALL : C.UTF-8 LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.1.0.dev0+2003.g9e1b95f72.dirty numpy : 1.17.5 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 46.4.0.post20200518 Cython : 0.29.19 pytest : 5.4.2 hypothesis : 5.15.1 sphinx : 3.0.4 blosc : None feather : None xlsxwriter : 1.2.8 lxml.etree : 4.5.1 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.14.0 pandas_datareader: None bs4 : 4.9.1 bottleneck : 1.3.2 fastparquet : 0.4.0 gcsfs : None matplotlib : 3.2.1 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : 0.17.1 pytables : None pyxlsb : None s3fs : 0.4.2 scipy : 1.4.1 sqlalchemy : 1.3.17 tables : 3.6.1 tabulate : 0.8.7 xarray : 0.15.1 xlrd : 1.2.0 xlwt : 1.3.0 numba : 0.49.1