Open albertvillanova opened 5 months ago
Note that the behavior described above is different if the format is JSON-Lines and the "pyarrow" engine is used:
json_lines = b'{"col1": 1, "col2": 1.0}\n{"col1": 2, "col2": 2.0}'
df = pd.read_json(io.BytesIO(json_lines), lines=True, engine="pyarrow")
assert not (df["col1"].dtype == df["col2"].dtype)
On the other hand, the downcasting appears again if the "ujson" engine (the default one) is used:
json_lines = b'{"col1": 1, "col2": 1.0}\n{"col1": 2, "col2": 2.0}'
df = pd.read_json(io.BytesIO(json_lines), lines=True)
assert df["col1"].dtype == df["col2"].dtype
Thats a good point
Also note that this downcasting is not performed by pandas.read_csv
:
csv_content = "col1,col2\n1,1.0\n2,2.0"
df = pd.read_csv(io.StringIO(csv_content))
assert not (df["col1"].dtype == df["col2"].dtype)
Additionally, str column is also cast to int:
d = [{"col1": 1, "col2": 1.0, "col3": "1"}, {"col1": 2, "col2": 2.0, "col3": "2"}]
df = pd.read_json(io.StringIO(json.dumps(d)))
assert df["col1"].dtype == df["col2"].dtype == df["col3"].dtype
Passing dtype=False
, I get the expected behavior of the OP. But the docstring doesn't seem clear to me:
If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don’t infer dtypes at all, applies only to the data.
Perhaps the language can be improved.
@albertvillanova - can you confirm if dtype=False
satisfies your use-case? Labeling this as just a docs issue for now.
@rhshadrach thanks for your reply.
Unfortunately, passing dtype=False
does not satisfy my use-case, because indeed I was passing dtype_backend="pyarrow"
as well (I did not mention it in the description to make things simpler).
Therefore the float-to-int downcasting persists even if passing dtype=False
when passing dtype_backend="pyarrow"
:
d = [{"col1": 1, "col2": 1.0}, {"col1": 2, "col2": 2.0}]
df = pd.read_json(io.StringIO(json.dumps(d)), dtype_backend="pyarrow")
assert df["col1"].dtype == df["col2"].dtype
df = pd.read_json(io.StringIO(json.dumps(d)), dtype_backend="pyarrow", dtype=False)
assert df["col1"].dtype == df["col2"].dtype
Additionally, I would like to ask if in the former case (when no passing dtype_backend="pyarrow"
), there would be other side effects when passing dtype=False
. Would other dtypes be treated differently?
Thanks @albertvillanova - I've reclassified this issue.
Interesting that BytesIO works while StringIO doesn't. I'm a first-time contributor, not sure if I'm able to solve the bug, but would like to take a look at this issue.
take
@rhshadrach thanks for your reply.
Unfortunately, passing
dtype=False
does not satisfy my use-case, because indeed I was passingdtype_backend="pyarrow"
as well (I did not mention it in the description to make things simpler).Therefore the float-to-int downcasting persists even if passing
dtype=False
when passingdtype_backend="pyarrow"
:d = [{"col1": 1, "col2": 1.0}, {"col1": 2, "col2": 2.0}] df = pd.read_json(io.StringIO(json.dumps(d)), dtype_backend="pyarrow") assert df["col1"].dtype == df["col2"].dtype df = pd.read_json(io.StringIO(json.dumps(d)), dtype_backend="pyarrow", dtype=False) assert df["col1"].dtype == df["col2"].dtype
Additionally, I would like to ask if in the former case (when no passing
dtype_backend="pyarrow"
), there would be other side effects when passingdtype=False
. Would other dtypes be treated differently?
@albertvillanova @rhshadrach
I did more experiments for this issue. It seems like with dtype=False
and the default dtype_backend
. The types are rendered fine, would this help with the use case?
data = [{"col1": 1, "col2": 1.0, "col3": "1"}, {"col1": 2, "col2": 2.0, "col3": "2"}]
df = pd.read_json(io.StringIO(json.dumps(data)), dtype=False)
assert df["col1"].dtype != df["col2"].dtype
assert df["col2"].dtype != df["col3"].dtype
assert df["col1"].dtype != df["col3"].dtype
# df["col1"].dtype == dtype('int64')
# df["col2"].dtype == dtype('float64')
# df["col3"].dtype == dtype('O')
As commented above, I need passing dtype_backend="pyarrow"
.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Although the explicit JSON values are of float type, the corresponding column dtype is of int dtype.
Expected Behavior
If the JSON contains float values, we would expect the corresponding column dtype is float as well.
At least, we should be able to avoid this casting if needed.
Installed Versions