pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.65k stars 17.58k forks source link

BUG: Difference between calamine and openpyxl readers - columns with mixed data types #59186

Open asishm opened 4 days ago

asishm commented 4 days ago

Pandas version checks

Reproducible Example

import pandas as pd
import numpy as np
from io import BytesIO
from datetime import datetime

df = pd.DataFrame({'a': ['abc', np.nan, datetime.now(), 'def']})

out = BytesIO()
df.to_excel(out, index=False)
df_openpyxl = pd.read_excel(out, engine='openpyxl')
df_calamine = pd.read_excel(out, engine='calamine')

In [49]: df_openpyxl['a'].map(type)
Out[49]:
0                  <class 'str'>
1                <class 'float'>
2    <class 'datetime.datetime'>
3                  <class 'str'>
Name: a, dtype: object

In [50]: df_calamine['a'].map(type)
Out[50]:
0                                        <class 'str'>
1                                      <class 'float'>
2    <class 'pandas._libs.tslibs.timestamps.Timesta...
3                                        <class 'str'>
Name: a, dtype: object

Issue Description

When trying to read an excel file that has mixed formats, using the calamine engine - results in an object dtype column where the value is pandas._libs.tslibs.timestamps.Timestamp and pandas._libs.tslibs.timedeltas.Timedelta whereas with the openpyxl engine, these values are datetime.datetime and datetime.timedelta instead.

The difference seems to be because the calamine implementation explicitly sets pd.Timestamp/pd.Timedelta data types https://github.com/pandas-dev/pandas/blob/79067a76adc448d17210f2cf4a858b0eb853be4c/pandas/io/excel/_calamine.py#L109-L115

whereas the openpyxl implementation uses whatever the library returns.

Expected Behavior

The individual item data types should match

Installed Versions

INSTALLED VERSIONS ------------------ commit : dd87dd3ef661ac06e30adc55d25a1f03deda3abf python : 3.10.14 python-bits : 64 OS : Linux OS-release : 4.19.128-microsoft-standard Version : #1 SMP Tue Jun 23 12:58:10 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 3.0.0.dev0+1113.gdd87dd3ef6 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0 pip : 24.0 Cython : 3.0.10 sphinx : 7.3.7 IPython : 8.25.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : 1.3.8 fastparquet : 2024.5.0 fsspec : 2024.6.0 html5lib : 1.1 hypothesis : 6.103.0 gcsfs : 2024.6.0 jinja2 : 3.1.4 lxml.etree : 5.2.2 matplotlib : 3.8.4 numba : 0.59.1 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.2 psycopg2 : 2.9.9 pymysql : 1.4.6 pyarrow : 16.1.0 pyreadstat : 1.2.7 pytest : 8.2.1 python-calamine : None pyxlsb : 1.0.10 s3fs : 2024.6.0 scipy : 1.13.1 sqlalchemy : 2.0.30 tables : 3.9.2 tabulate : 0.9.0 xarray : 2024.5.0 xlrd : 2.0.1 xlsxwriter : 3.1.9 zstandard : 0.19.0 tzdata : 2024.1 qtpy : None pyqt5 : None
asishm commented 4 days ago

Looking at it a bit more, isinstance(pd.Timestamp(...), datetime) returns True and same with isinstance(pd.Timedelta(...), timedelta), so maybe it's moot.

rhshadrach commented 1 day ago

I don't see a very strong reason to prefer one over the other. When doing inference for constructors, we store these as datetime.date objects. E.g.

import datetime

print(type(pd.Series(datetime.date(2024, 3, 10)).iloc[0]))
# <class 'datetime.date'>

In addition, openpyxl has been around much longer than calamine. Both of these suggest to me we should return the Python objects.