pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

BUG: pandas.read_json casts float column to int #58866

Open albertvillanova opened 5 months ago

albertvillanova commented 5 months ago

Pandas version checks

Reproducible Example

import io
import json
import pandas as pd

d = [{"col1": 1, "col2": 1.0}, {"col1": 2, "col2": 2.0}]
df = pd.read_json(io.StringIO(json.dumps(d)))
assert df["col1"].dtype == df["col2"].dtype

Issue Description

Although the explicit JSON values are of float type, the corresponding column dtype is of int dtype.

Expected Behavior

If the JSON contains float values, we would expect the corresponding column dtype is float as well.

At least, we should be able to avoid this casting if needed.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.9.7.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-107-generic Version : #117~20.04.1-Ubuntu SMP Tue Apr 30 10:35:57 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.22.4 pytz : 2021.3 dateutil : 2.8.2 setuptools : 57.0.0 pip : 24.0 Cython : 0.29.24 pytest : None hypothesis : None sphinx : 3.1.2 blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.4 html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 7.30.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.5.0 gcsfs : None matplotlib : None numba : 0.58.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 14.0.2 pyreadstat : None python-calamine : None pyxlsb : None s3fs : 2024.5.0 scipy : 1.11.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
albertvillanova commented 5 months ago

Note that the behavior described above is different if the format is JSON-Lines and the "pyarrow" engine is used:

json_lines = b'{"col1": 1, "col2": 1.0}\n{"col1": 2, "col2": 2.0}'

df = pd.read_json(io.BytesIO(json_lines), lines=True, engine="pyarrow")
assert not (df["col1"].dtype == df["col2"].dtype)

On the other hand, the downcasting appears again if the "ujson" engine (the default one) is used:

json_lines = b'{"col1": 1, "col2": 1.0}\n{"col1": 2, "col2": 2.0}'

df = pd.read_json(io.BytesIO(json_lines), lines=True)
assert df["col1"].dtype == df["col2"].dtype
PushpitSB commented 5 months ago

Thats a good point

albertvillanova commented 5 months ago

Also note that this downcasting is not performed by pandas.read_csv:

csv_content = "col1,col2\n1,1.0\n2,2.0"

df = pd.read_csv(io.StringIO(csv_content))
assert not (df["col1"].dtype == df["col2"].dtype)
albertvillanova commented 5 months ago

Additionally, str column is also cast to int:

d = [{"col1": 1, "col2": 1.0, "col3": "1"}, {"col1": 2, "col2": 2.0, "col3": "2"}]

df = pd.read_json(io.StringIO(json.dumps(d)))
assert df["col1"].dtype == df["col2"].dtype == df["col3"].dtype
rhshadrach commented 5 months ago

Passing dtype=False, I get the expected behavior of the OP. But the docstring doesn't seem clear to me:

If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don’t infer dtypes at all, applies only to the data.

Perhaps the language can be improved.

@albertvillanova - can you confirm if dtype=False satisfies your use-case? Labeling this as just a docs issue for now.

albertvillanova commented 5 months ago

@rhshadrach thanks for your reply.

Unfortunately, passing dtype=False does not satisfy my use-case, because indeed I was passing dtype_backend="pyarrow" as well (I did not mention it in the description to make things simpler).

Therefore the float-to-int downcasting persists even if passing dtype=False when passing dtype_backend="pyarrow":

d = [{"col1": 1, "col2": 1.0}, {"col1": 2, "col2": 2.0}]

df = pd.read_json(io.StringIO(json.dumps(d)), dtype_backend="pyarrow")
assert df["col1"].dtype == df["col2"].dtype

df = pd.read_json(io.StringIO(json.dumps(d)), dtype_backend="pyarrow", dtype=False)
assert df["col1"].dtype == df["col2"].dtype

Additionally, I would like to ask if in the former case (when no passing dtype_backend="pyarrow"), there would be other side effects when passing dtype=False. Would other dtypes be treated differently?

rhshadrach commented 5 months ago

Thanks @albertvillanova - I've reclassified this issue.

chialin6 commented 3 months ago

Interesting that BytesIO works while StringIO doesn't. I'm a first-time contributor, not sure if I'm able to solve the bug, but would like to take a look at this issue.

chialin6 commented 3 months ago

take

chialin6 commented 3 months ago

@rhshadrach thanks for your reply.

Unfortunately, passing dtype=False does not satisfy my use-case, because indeed I was passing dtype_backend="pyarrow" as well (I did not mention it in the description to make things simpler).

Therefore the float-to-int downcasting persists even if passing dtype=False when passing dtype_backend="pyarrow":

d = [{"col1": 1, "col2": 1.0}, {"col1": 2, "col2": 2.0}]

df = pd.read_json(io.StringIO(json.dumps(d)), dtype_backend="pyarrow")
assert df["col1"].dtype == df["col2"].dtype

df = pd.read_json(io.StringIO(json.dumps(d)), dtype_backend="pyarrow", dtype=False)
assert df["col1"].dtype == df["col2"].dtype

Additionally, I would like to ask if in the former case (when no passing dtype_backend="pyarrow"), there would be other side effects when passing dtype=False. Would other dtypes be treated differently?

@albertvillanova @rhshadrach I did more experiments for this issue. It seems like with dtype=False and the default dtype_backend. The types are rendered fine, would this help with the use case?

data = [{"col1": 1, "col2": 1.0, "col3": "1"}, {"col1": 2, "col2": 2.0, "col3": "2"}]
df = pd.read_json(io.StringIO(json.dumps(data)), dtype=False)
assert df["col1"].dtype != df["col2"].dtype
assert df["col2"].dtype != df["col3"].dtype
assert df["col1"].dtype != df["col3"].dtype

# df["col1"].dtype == dtype('int64')
# df["col2"].dtype == dtype('float64')
# df["col3"].dtype == dtype('O')
albertvillanova commented 3 months ago

As commented above, I need passing dtype_backend="pyarrow".