pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.72k stars 17.93k forks source link

BUG: pyxlsb fails to parse datetime-like data #60062

Closed MCRE-BE closed 3 weeks ago

MCRE-BE commented 3 weeks ago

Pandas version checks

Reproducible Example

When trying to build some new tests, I found a strange behavior.

import pandas as pd
from pathlib import Path

files = Path(".").glob("test1*")
files = list(files)
files
>>> [WindowsPath('test1.ods'),
>>>  WindowsPath('test1.xls'),
>>> WindowsPath('test1.xlsb'),
>>> WindowsPath('test1.xlsm'),
>>> WindowsPath('test1.xlsx')]

for f in files:
    print(f)
    display(pd.read_excel(f))

Issue Description

image

Expected Behavior

All dataframes should be read the same

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : Dutch_Belgium.1252 pandas : 2.2.2 numpy : 2.0.2 pytz : 2024.2 dateutil : 2.9.0 setuptools : 75.1.0 pip : 24.2 Cython : None pytest : 8.3.3 hypothesis : None sphinx : 7.4.7 blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.3.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.21.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.4.0 dataframe-api-compat : None fastparquet : None fsspec : 2024.9.0 gcsfs : None matplotlib : 3.9.1 numba : 0.60.0 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.5 pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : 1.0.10 s3fs : None scipy : 1.14.1 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : 0.23.0 tzdata : 2024.2 qtpy : 2.4.1 pyqt5 : None
asishm commented 3 weeks ago

This is a limitation of the pyxlsb library. Their docs note that

Do note that dates will appear as floats. You must use the convert_date(date) method from the pyxlsb module to turn them into datetime instances.

Also, you'll notice there's a marker that xfails datetime related tests with xlsb files - https://github.com/pandas-dev/pandas/blob/2a10e04a099d5f1633abcdfbb2dd9fdf09142f8d/pandas/tests/io/excel/test_readers.py#L153

MCRE-BE commented 3 weeks ago

@asishm : Thanks for the information. So it's a "feature" / "chosen limitation" (I don't know how to express this better) instead of an oversight that Pandas does convert float to dates for pyxlsb ? If it's the case, we can indeed close this issue as it's not a bug.

asishm commented 3 weeks ago

Pandas relies on other libraries to do most of the heavy lifting in parsing excel files. In case of xlsb files, pandas relies on pyxlsb. It looks like results returned by the library don't differentiate between dates and floats (unlike other libraries like openpyxl for xlsx files). Therefore, pandas has no way to know which columns are date columns.

If you know which columns are dates, you can pass in pyxlsb's convert_date function in the converters parameter of pd.read_excel (or do it after the pd.read_excel call)

In [6]: df = pd.read_excel("./pandas/tests/io/data/excel/test1.xlsb", converters={0: pyxlsb.convert
   ...: _date})

In [7]: df
Out[7]:
  Unnamed: 0         A         B         C         D
0 2000-01-03  0.980269  3.685731 -0.364217 -1.159738
1 2000-01-04  1.047916 -0.041232 -0.161812  0.212549
2 2000-01-05  0.498581  0.731168 -0.537677  1.346270
3 2000-01-06  1.120202  1.567621  0.003641  0.675253
4 2000-01-07 -0.487094  0.571455 -1.611639  0.103469
5 2000-01-10  0.836649  0.246462  0.588543  1.062782
6 2000-01-11 -0.157161  1.340307  1.195778 -1.097007

I guess a note indicating this limitation might be good.