pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.21k stars 17.78k forks source link

BUG: concat with datetime64 columns of differing resolutions results in object dtype #53307

Open dalejung opened 1 year ago

dalejung commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd

df = pd.DataFrame(
    {
        'ints': range(2),
        'dates': pd.date_range("2000", periods=2, freq="T"),
    },
)
df2 = df.copy()
df2['dates'] = df.dates.astype('M8[s]')
combined = pd.concat([df, df2])
assert pd.api.types.is_datetime64_any_dtype(combined.dates.dtype) # fails

Issue Description

Combining datetime64 columns with different resolutions returns an object dtype. This has been around since 2.0 but it just recently triggered for me because of the df['datecol'] = ts now using the Timestamp resolution to set the column dtype instead of defaulting to ns.

Expected Behavior

Not 100% sure. I would assume the combined dtype should be the most precise of the provided datetime64 resolutions.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 634b9402466c96c51e6667af935bde01e7ecf144 python : 3.11.3.final.0 python-bits : 64 OS : Linux OS-release : 6.2.12-arch1-1 Version : #1 SMP PREEMPT_DYNAMIC Thu, 20 Apr 2023 16:11:55 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.0.dev0+802.g634b940246 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 65.3.0 pip : 22.2.2 Cython : 0.29.33 pytest : 7.1.3 hypothesis : 6.47.1 sphinx : 6.2.1 blosc : 1.11.1 feather : None xlsxwriter : 3.1.0 lxml.etree : 4.9.2 html5lib : 1.1 pymysql : 1.0.3 psycopg2 : 2.9.6 jinja2 : 3.1.2 IPython : 8.6.0.dev pandas_datareader: None bs4 : 4.12.2 bottleneck : 1.3.7 brotli : fastparquet : 2023.4.0 fsspec : 2023.5.0 gcsfs : 2023.5.0 matplotlib : 3.7.1 numba : 0.57.0 numexpr : 2.8.4 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 13.0.0.dev59+gb009c1242 pyreadstat : 1.2.1 pyxlsb : 1.0.10 s3fs : 2023.5.0 scipy : 1.10.1 snappy : sqlalchemy : 2.0.14 tables : 3.8.0 tabulate : 0.9.0 xarray : 2023.5.0 xlrd : 2.0.1 zstandard : 0.21.0 tzdata : 2023.3 qtpy : None pyqt5 : None
jbrockmendel commented 1 year ago

cc @jorisvandenbossche suggested recently changing find_common_type behavior here. i think id be OK with that.

topper-123 commented 1 year ago

+1 on using find_common_type or similar function. One complication would be that that function would need to figure out if the timestamps in the less precise datetime array fits within the max/min time limits of the more precise datetime array

meistermeister commented 1 year ago

This also occurs with mismatched timezones at the same resolution. Is that the same bug?

jbrockmendel commented 1 year ago

That is intentional, but there has been discussion of changing/deprecating it to become UTC

meistermeister commented 1 year ago

I think that behavior is new, perhaps it should raise a warning?

jbrockmendel commented 1 year ago

Discussed a couple dev calls ago, agreed to have this cast to the higher resolution. I dont recall if we said the change needed a deprecation cycle.

JMBurley commented 1 year ago

Just stumbled across this issue. Totally agree on casting to the higher resolution, and I would strongly advocate treating this as a BUG and not running a deprecation cycle.

Principle of least surprise is that concatenating dates results in dates. The fact that one source is (say) ms and the other ns resolution does not make dtype conversion to object a not-surprising piece of behaviour.

Note that this bug also occurs if concatenating data from a single timezone, at a constant resolution, if that timezone has multiple offsets from UTC (ie. daylight savings occurred during the series' timespan).

I think that brings the sum addressable issues to: BUG

INTENTIONAL (but may change)

badge commented 7 months ago

TL;DR: I think this is fixed in 2.2.0.

I came across this issue while searching for the issue I was having (https://github.com/pandas-dev/pandas/issues/53640). It appears the fix for that issue in 2.2.0 (https://github.com/pandas-dev/pandas/pull/53641) has also resolved this issue as I can't recreate it in 2.2.0, while I can in 2.1.4.