pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.65k stars 17.92k forks source link

BUG: read_json silently ignores the dtype when engine=pyarrow #59516

Open benjamin-hodgson opened 2 months ago

benjamin-hodgson commented 2 months ago

Pandas version checks

Reproducible Example

import pandas as pd
import json

with open("test.json", "w") as f:
    json.dump({"Col1": 123}, f)

dtype = {"Col1": "int32[pyarrow]"}

df = pd.read_json("test.json", dtype=dtype, lines=True, engine="pyarrow", dtype_backend="pyarrow")

print(df.dtypes.to_dict())  # prints {'Col1': int64[pyarrow]}

Issue Description

The call to pyarrow_json.read_json doesn't pass any ParseOptions. This means that determining the dtype is always left up to the pyarrow json parser's dtype inference system, even when we explicitly passed a requested dtype into pd.read_json.

Expected Behavior

read_json's pyarrow parser should respect the requested dtype.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.3.final.0 python-bits : 64 OS : Windows OS-release : 11 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252 pandas : 2.2.2 numpy : 2.0.1 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 72.1.0 pip : None Cython : None pytest : 8.3.2 hypothesis : None sphinx : 7.4.7 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.26.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.1 gcsfs : None matplotlib : 3.9.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.0 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
rhshadrach commented 2 months ago

Thanks for the report, further investigations and PRs to fix are welcome!

jahn96 commented 2 months ago

take

KevsterAmp commented 1 month ago

@jahn96 are you still working on this?

jahn96 commented 1 month ago

@jahn96 are you still working on this?

Yes! Just got back from the break and raised a PR