pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.9k stars 18.03k forks source link

BUG: df.to_json causes low precision in floats #59313

Open bytelinker opened 4 months ago

bytelinker commented 4 months ago

Pandas version checks

Reproducible Example

import numpy as np
import pandas as pd
import sys

# settings pd
pd.options.display.precision = 16
pd.options.styler.format.precision = 16
# pd.describe_option()

limit = 17
exponents = np.linspace(-limit, limit, num=2 * limit + 1, endpoint=True, dtype='float64')
test_values = 1.2345678901234567890 * 10 ** exponents

df = pd.DataFrame()
df['x'] = test_values
# Convert the floating-point numbers to strings with high precision
df['x_string'] = df['x'].apply(lambda x: format(x, '.18e'))
df['exponent'] = exponents

df.to_json('./bug_df_to_json.json', indent=4, double_precision=15)

print(f"Pandas version: {pd.__version__}")
print(f"Numpy version: {np.__version__}")
print(f"Python version: {sys.version}")
# print(df)

#
# "1": 1.23456789012346e-16,                # Precision 15
# "2": 0.000000000000001,                   # Precision 1
# "3": 0.000000000000012,                   # Precision 2
# "4": 0.000000000000123,                   # Precision 3
# "5": 0.000000000001235,                   # Precision 4
# "6": 0.000000000012346,                   # Precision 5
# "7": 0.000000000123457,
# "8": 0.000000001234568,
# "9": 0.000000012345679,
# "10": 0.000000123456789,
# "11": 0.00000123456789,
# "12": 0.000012345678901,
# "13": 0.000123456789012,
# "14": 0.001234567890123,
# "16": 0.123456789012346,
# "17": 1.234567890123457,
# "15": 0.012345678901235,
# "18": 12.345678901234567,
# "19": 123.456789012345666,
# "20": 1234.567890123456664,
# "21": 12345.678901234567093,
# "22": 123456.789012345674564,
# "23": 1234567.890123456716538,
# "24": 12345678.901234567165375,
# "25": 123456789.012345671653748,
# "26": 1234567890.123456716537476,
# "27": 12345678901.234567642211914,
# "28": 123456789012.345672607421875,     # Precision 27
# "29": 1234567890123.456787109375,
# "30": 12345678901234.56640625,
# "31": 123456789012345.671875,
# "32": 1234567890123456.75,              Precision 18
import numpy as np
import pandas as pd
import sys

# settings pd
pd.options.display.precision = 16
pd.options.styler.format.precision = 16
# pd.describe_option()

limit = 17
exponents = np.linspace(-limit, limit, num=2 * limit + 1, endpoint=True, dtype='float64')
test_values = 1.2345678901234567890 * 10 ** exponents

df = pd.DataFrame()
df['x'] = test_values
# Convert the floating-point numbers to strings with high precision
df['x_string'] = df['x'].apply(lambda x: format(x, '.18e'))
df['exponent'] = exponents

df.to_json('./bug_df_to_json.json', indent=4, double_precision=15)

print(f"Pandas version: {pd.__version__}")
print(f"Numpy version: {np.__version__}")
print(f"Python version: {sys.version}")
# print(df)

#
# "1": 1.23456789012346e-16,                # Precision 15
# "2": 0.000000000000001,                   # Precision 1
# "3": 0.000000000000012,                   # Precision 2
# "4": 0.000000000000123,                   # Precision 3
# "5": 0.000000000001235,                   # Precision 4
# "6": 0.000000000012346,                   # Precision 5
# "7": 0.000000000123457,
# "8": 0.000000001234568,
# "9": 0.000000012345679,
# "10": 0.000000123456789,
# "11": 0.00000123456789,
# "12": 0.000012345678901,
# "13": 0.000123456789012,
# "14": 0.001234567890123,
# "16": 0.123456789012346,
# "17": 1.234567890123457,
# "15": 0.012345678901235,
# "18": 12.345678901234567,
# "19": 123.456789012345666,
# "20": 1234.567890123456664,
# "21": 12345.678901234567093,
# "22": 123456.789012345674564,
# "23": 1234567.890123456716538,
# "24": 12345678.901234567165375,
# "25": 123456789.012345671653748,
# "26": 1234567890.123456716537476,
# "27": 12345678901.234567642211914,
# "28": 123456789012.345672607421875,     # Precision 27
# "29": 1234567890123.456787109375,
# "30": 12345678901234.56640625,
# "31": 123456789012345.671875,
# "32": 1234567890123456.75,              Precision 18

Issue Description

the floats written to json have got a fixed length, the last digits are truncated. so precision is lost. the floats could be converted to scientific notation so no precision would be lost.

possible code for issue: https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/src/vendored/ujson/lib/ultrajsonenc.c https://github.com/pandas-dev/pandas/blob/63dc1bb4f99d24b46bacb113d740d54459fdbe5e/pandas/_libs/src/vendored/ujson/lib/ultrajsonenc.c#L818

Expected Behavior

convert floats to scientific notation in json

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 170 Stepping 4, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : de_DE.cp1252 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 68.0.0 pip : 24.0 Cython : 3.0.9 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.1.0 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.12.3 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2023.6.0 gcsfs : None matplotlib : 3.8.3 numba : 0.59.1 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : 1.0.10 s3fs : None scipy : 1.12.0 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : 2024.2.0 xlrd : None zstandard : 0.22.0 tzdata : 2024.1 qtpy : 2.4.1 pyqt5 : None
bigrogerio commented 3 months ago

can I investigate this issue?

rhshadrach commented 3 months ago

Thanks for the report. I think the first line using exponential notation is a duplicate of #23328. It seems to me the intended behavior of to_json is to remove any information after what is specified by double_precision. This would mean a request to not do so is a feature enhancement.

bytelinker commented 3 months ago

Thank you for the answer. When writing a "test_values = 1.2345678901234567890 * 10 ** exponents" to json and getting back for example 0.000000000000001 this is an error of approx 23.4 %. The digits after 1 = 234567..... are missing. Can this be changed by using other options in the actual version, please? But as it can be done without a warning I would consider this as a bug.

bytelinker commented 3 months ago

Hello bigrogerio,

Of course you can investigate this issue.

rhshadrach commented 3 months ago

Can this be changed by using other options in the actual version, please?

Not that I am aware of.

But as it can be done without a warning I would consider this as a bug.

From what I can tell, this seems to be the intended design. If there is any indication to the contrary, then I would agree it is a bug. However if it is indeed the intended design, then what you are asking for is an API/behavior change rather than a bugfix.

bytelinker commented 3 months ago

Using floats in pandas is a common uses case. When saving these floats to a json and loading them again, a lo of precision is lost. Much more than can be expected. I can't see why this is the intended behavior.