BUG: pyxlsb fails to parse datetime-like data

MCRE-BE commented 3 weeks ago

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

When trying to build some new tests, I found a strange behavior.

My apologies, but I can't simply have a copy/paste code answer as we can't export to xlsband xlsmdirectly in pandas if I'm not mistaken.
Data can be found with test environment in pandas\tests\io\data\excel
A local copy : 07. Pandas tests.zip

import pandas as pd
from pathlib import Path

files = Path(".").glob("test1*")
files = list(files)
files
>>> [WindowsPath('test1.ods'),
>>>  WindowsPath('test1.xls'),
>>> WindowsPath('test1.xlsb'),
>>> WindowsPath('test1.xlsm'),
>>> WindowsPath('test1.xlsx')]

for f in files:
    print(f)
    display(pd.read_excel(f))

Issue Description

Expected Behavior

All dataframes should be read the same

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : Dutch_Belgium.1252 pandas : 2.2.2 numpy : 2.0.2 pytz : 2024.2 dateutil : 2.9.0 setuptools : 75.1.0 pip : 24.2 Cython : None pytest : 8.3.3 hypothesis : None sphinx : 7.4.7 blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.3.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.21.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.4.0 dataframe-api-compat : None fastparquet : None fsspec : 2024.9.0 gcsfs : None matplotlib : 3.9.1 numba : 0.60.0 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.5 pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : 1.0.10 s3fs : None scipy : 1.14.1 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : 0.23.0 tzdata : 2024.2 qtpy : 2.4.1 pyqt5 : None

asishm commented 3 weeks ago

This is a limitation of the pyxlsb library. Their docs note that

Do note that dates will appear as floats. You must use the convert_date(date) method from the pyxlsb module to turn them into datetime instances.

Also, you'll notice there's a marker that xfails datetime related tests with xlsb files - https://github.com/pandas-dev/pandas/blob/2a10e04a099d5f1633abcdfbb2dd9fdf09142f8d/pandas/tests/io/excel/test_readers.py#L153

MCRE-BE commented 3 weeks ago

@asishm : Thanks for the information. So it's a "feature" / "chosen limitation" (I don't know how to express this better) instead of an oversight that Pandas does convert float to dates for pyxlsb ? If it's the case, we can indeed close this issue as it's not a bug.

asishm commented 3 weeks ago

Pandas relies on other libraries to do most of the heavy lifting in parsing excel files. In case of xlsb files, pandas relies on pyxlsb. It looks like results returned by the library don't differentiate between dates and floats (unlike other libraries like openpyxl for xlsx files). Therefore, pandas has no way to know which columns are date columns.

If you know which columns are dates, you can pass in pyxlsb's convert_date function in the converters parameter of pd.read_excel (or do it after the pd.read_excel call)

In [6]: df = pd.read_excel("./pandas/tests/io/data/excel/test1.xlsb", converters={0: pyxlsb.convert
   ...: _date})

In [7]: df
Out[7]:
  Unnamed: 0         A         B         C         D
0 2000-01-03  0.980269  3.685731 -0.364217 -1.159738
1 2000-01-04  1.047916 -0.041232 -0.161812  0.212549
2 2000-01-05  0.498581  0.731168 -0.537677  1.346270
3 2000-01-06  1.120202  1.567621  0.003641  0.675253
4 2000-01-07 -0.487094  0.571455 -1.611639  0.103469
5 2000-01-10  0.836649  0.246462  0.588543  1.062782
6 2000-01-11 -0.157161  1.340307  1.195778 -1.097007

I guess a note indicating this limitation might be good.

pandas-dev / pandas