pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.66k stars 17.92k forks source link

Int64 numbers from Pandas DataFrame.to_markdown() incorrectly displayed #49465

Open jbencina opened 1 year ago

jbencina commented 1 year ago

Summary

When a Pandas DataFrame contains a 64 bit integer and the .to_markdown() method is called on the DataFrame, the printed integer is incorrect due to overflow.

This behavior is being passed along by the tabulate package but is really a fundamental Python issue. I bring this up here because the Pandas .head() method does print the correct number. Should Pandas be handling this case to present a consistent view of DataFrame data to users regardless of method?

If this fix is outside the scope of Pandas, perhaps the Pandas documentation should be updated as a warning.

Reproduction

Test 64bit int with Pandas head()

import pandas as pd
df = pd.DataFrame({'colA': [503498111827123021]})
df.head()
                 colA
0  503498111827123021

Test 64bit int with Pandas to_markdown()

import pandas as pd
df = pd.DataFrame({'colA': [503498111827123021]})
print(df.to_markdown(floatfmt='.0f'))
|    |               colA |
|---:|-------------------:|
|  0 | 503498111827123008 |

Test with Python format()

>>> format(503498111827123021, '.0f')
'503498111827123008'

Pandas Version

Python 3.9.6 (default, Aug  5 2022, 15:21:02)
[Clang 14.0.0 (clang-1400.0.29.102)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 91111fd99898d9dcaa6bf6bedb662db4108da6e6
python           : 3.9.6.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.6.0
Version          : Darwin Kernel Version 21.6.0: Thu Sep 29 20:12:57 PDT 2022; root:xnu-8020.240.7~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.1
numpy            : 1.23.4
pytz             : 2022.6
dateutil         : 2.8.2
setuptools       : 58.0.4
pip              : 21.2.4
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : 0.9.0
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None
MarcoGorelli commented 1 year ago

Thanks @jbencina for the report

Looks like the issue isn't in pandas?

In [7]: tabulate.tabulate(df, floatfmt='.0f')
Out[7]: '-  ------------------\n0  503498111827123008\n-  ------------------'

Might be something to report to https://github.com/astanin/python-tabulate

jbencina commented 1 year ago

@MarcoGorelli Thanks. I opened a ticket with the tabulate team https://github.com/astanin/python-tabulate/issues/213. The root cause seems to be that tabulate is treating the int64 data type as a float when coming from a DataFrame. The result is applying the incorrect Python formatting to it. Passing a long int directly to tabulate doesn't produce this issue

table = [[503498111827123021]]

print(tabulate(table))
------------------
503498111827123021

print(tabulate(table, floatfmt='.0f'))
------------------
503498111827123021
------------------
jbencina commented 1 year ago

Confirmed this is fixed in the upcoming release of tabulate

MarcoGorelli commented 1 year ago

cool, thanks!

MarcoGorelli commented 1 year ago

the minimum version should probably be bumped then - do you want to submit a pull request to do that?

(reopening the issue until the minimum version is bumped)

jbencina commented 1 year ago

Good point. I'll see if there's an idea when the next version will be out and circle back here with a PR when available