DavidNaizheZhou commented 2 weeks ago

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

data = {
    "level1": {
        "rows": [
            {"col1": 1, "col2": 2},
        ]
    },
    "meta1": {
        "meta_sub1": 1,
    }, 
}

df = pd.json_normalize(data, record_path=["level1", "rows"], meta=["meta1"])
print(df)

df = pd.json_normalize(
    data,
    record_path=["level1", "rows"],
    meta=[["meta1", "meta_sub1"]],  # Trying to access sub-fields within meta1
)

Issue Description

Description of the Issue

This reproducible example demonstrates the challenges and potential pitfalls when using pandas.json_normalize() to extract and flatten hierarchical data structures with nested metadata:

Data Structure

The data dictionary is multi-layered, with nested dictionaries and a list of dictionaries (rows) under level1. Additionally, meta1 is structured as a dictionary containing subfields.

Successful Normalization

The first call to pd.json_normalize() extracts the data from rows under level1 and includesmeta1as a top-level metadata field. This works as intended becausemeta1 is accessed directly as a single key.

Output:

   col1  col2             meta1
0     1     2  {'meta_sub1': 1}

KeyError with Nested Meta Fields

The second pd.json_normalize() call attempts to extract subfields from meta1 using a nested path (meta=[["meta1", "meta_sub1"]]). This results in a KeyError because json_normalize() does not natively support nested lists for specifying paths within the meta parameter.

Expected Behavior

df = pd.json_normalize(
    data,
    record_path=["level1", "rows"],
    meta=[["meta1", "meta_sub1"]],  # Trying to access sub-fields within meta1
)

   col1  col2  meta1
0     1     2     1

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.1 python-bits : 64 OS : Windows OS-release : 11 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : de_DE.cp1252 pandas : 2.2.3 numpy : 1.26.2 pytz : 2024.1 dateutil : 2.8.2 pip : 24.3.1 Cython : None sphinx : 8.1.3 IPython : 8.17.2 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.9.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.3 lxml.etree : 5.2.2 matplotlib : 3.8.3 numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None psycopg2 : None pymysql : None pyarrow : 15.0.0 pyreadstat : None pytest : 8.1.1 python-calamine : None pyxlsb : 1.0.10 s3fs : None scipy : 1.11.4 sqlalchemy : 2.0.28 tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 xlsxwriter : 3.2.0 zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

rhshadrach commented 2 weeks ago

Thanks for the report. Is this the same as #59233?

DavidNaizheZhou commented 2 weeks ago

Yes, it is. Missed that, sorry about that.

DavidNaizheZhou commented 2 weeks ago

Found a solution that resolved it for "me". The change involves modifying the _recursive_extract function in _normalize.py. Below is the specific change I made:

    def _recursive_extract(data, path, seen_meta, level: int = 0, root_obj=None) -> None:
        if isinstance(data, dict):
            data = [data]
        if len(path) > 1:
            for obj in data:
                if root_obj is None:
                    root_obj = obj
                for val, key in zip(_meta, meta_keys):
                    if level + 1 == len(val):
                        seen_meta[key] = _pull_field(root_obj, val)

                _recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1, root_obj=root_obj)
        else:
            for obj in data:
                recs = _pull_records(obj, path[0])
                recs = [nested_to_record(r, sep=sep, max_level=max_level) if isinstance(r, dict) else r for r in recs]

                # For repeating the metadata later
                lengths.append(len(recs))
                for val, key in zip(_meta, meta_keys):
                    if level + 1 > len(val):
                        meta_val = seen_meta[key]
                    else:
                        meta_val = _pull_field(root_obj, val)
                    meta_vals[key].append(meta_val)
                records.extend(recs)

rhshadrach commented 2 weeks ago

This results in a KeyError because json_normalize() does not natively support nested lists for specifying paths within the meta parameter.

It does support nested lists, however it seems to make the assumption that all except the last element agree with record_path. E.g.

data = {
    "level1": [
        {
            "rows": [{"col1": 1, "col2": 2}, {"col1": 3, "col2": 4}],
            "meta1": 1,
        },
        {
            "rows": [{"col1": 5, "col2": 6}, {"col1": 7, "col2": 8}],
            "meta1": 2,
        },
    ],
}
df = pd.json_normalize(data, record_path=["level1", "rows"], meta=[["level1", "meta1"]])
print(df)
#    col1  col2 level1.meta1
# 0     1     2            1
# 1     3     4            1
# 2     5     6            2
# 3     7     8            2

While I'm not very familiar with this functionality, I believe the intention is to have metadata that sits alongside each collection of records.

If the metadata does not sit alongside each collection of records, then I think the result would necessarily be a constant column. Is that your desire @DavidNaizheZhou?

pandas-dev / pandas

BUG: Challenges with Nested Metadata Extraction Using pandas.json_normalize( #60254

Pandas version checks

Reproducible Example

Issue Description

Description of the Issue

Data Structure

Successful Normalization

Output:

KeyError with Nested Meta Fields

Expected Behavior

Installed Versions