pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.51k stars 17.88k forks source link

BUG: json_normalize does not parse nested lists consistently #53126

Open marickmanrho opened 1 year ago

marickmanrho commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd

nested_array_data = {
    "B": {"A": [[1,2],[3,4]]}
}

# No record path
df1 = pd.json_normalize(nested_array_data)
print('df1:\n', df1.head())

# Partial record path
try:
    df2 = pd.json_normalize(nested_array_data, record_path=["B"])
except TypeError as e:
    print('df2:\n', f"TypeError: {e}")

# Full record path
df3 = pd.json_normalize(nested_array_data, record_path=['B', 'A'])
print('df3:\n', df3.head())

Issue Description

Depending on the record_path supplied to json_normalize you get different results. This is unexpected as, in this example, the record_path should only change the name of the columns and not the rows itself. Furthermore, it throws an error in one case which is not supposed to throw an error based on the documentation.

From the example above this is the output:

# No record path
df1 = pd.json_normalize(nested_array_data)
B.A
0 [[1, 2], [3, 4]]
# Partial record path
df2 = pd.json_normalize(nested_array_data, record_path=["B"])

TypeError: {'B': {'A': [[1, 2], [3, 4]]}} has non list value {'A': [[1, 2], [3, 4]]} for path B. Must be list or null.

# Full record path
df3 = pd.json_normalize(nested_array_data, record_path=['B', 'A'])
0 1
0 1 2
1 3 4

Expected Behavior

As far as I can tell, there are two options for expected behavior.

  1. Always expand lists when they are encountered

    pd.json_normalize(nested_array_data, record_path=['B'])
    A.0.0 A.0.1 A.1.0 A.1.1
    0 1 2 3 4
  2. Lists are not expended when encountered

    pd.json_normalize(nested_array_data, record_path=['B'])
    A
    0 [[1,2],[3,4]]

A toggle could be implemented to switch between behaviors as proposed by #42311. One could also implement a more fine grained control over list expansion like proposed in #27241. In my opinion it is best to not expand list data by default as this leads to the least complications for most input data.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.10.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Dutch_Netherlands.1252 pandas : 1.5.2 numpy : 1.23.4 pytz : 2022.1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 22.3.1 Cython : None pytest : 7.1.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.13.2 pandas_datareader: None bs4 : None bottleneck : 1.3.5 brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.6.2 numba : None numexpr : 2.8.4 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.9.3 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None
topper-123 commented 1 year ago

Thanks for the report @marickmanrho.

Yeah, this is bug. We don't expand lists here as far as I know, so solution 2 would be the correct one. I've not used json_normalize all that much, so if that's wrong, I welcome a correction.

A toggle to switch behavior is an enhancement proposal, can you keep that separate from this bug, to keep things clean.

A PR welcome.

marickmanrho commented 1 year ago

Alright @topper-123, I'll prepare a PR with a fix later this week.

marickmanrho commented 1 year ago

I find it hard to make a small edit in the current implementation of json_normalize to fix this bug. I also foresee difficulties in fixing feature request later on. Would a major rewrite of json_normalize be acceptable @topper-123?

topper-123 commented 1 year ago

Can you outline what you're intending to do?