Open DavidNaizheZhou opened 2 weeks ago
Thanks for the report. Is this the same as #59233?
Yes, it is. Missed that, sorry about that.
Found a solution that resolved it for "me". The change involves modifying the _recursive_extract function in _normalize.py. Below is the specific change I made:
def _recursive_extract(data, path, seen_meta, level: int = 0, root_obj=None) -> None:
if isinstance(data, dict):
data = [data]
if len(path) > 1:
for obj in data:
if root_obj is None:
root_obj = obj
for val, key in zip(_meta, meta_keys):
if level + 1 == len(val):
seen_meta[key] = _pull_field(root_obj, val)
_recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1, root_obj=root_obj)
else:
for obj in data:
recs = _pull_records(obj, path[0])
recs = [nested_to_record(r, sep=sep, max_level=max_level) if isinstance(r, dict) else r for r in recs]
# For repeating the metadata later
lengths.append(len(recs))
for val, key in zip(_meta, meta_keys):
if level + 1 > len(val):
meta_val = seen_meta[key]
else:
meta_val = _pull_field(root_obj, val)
meta_vals[key].append(meta_val)
records.extend(recs)
This results in a
KeyError
becausejson_normalize()
does not natively support nested lists for specifying paths within themeta
parameter.
It does support nested lists, however it seems to make the assumption that all except the last element agree with record_path
. E.g.
data = {
"level1": [
{
"rows": [{"col1": 1, "col2": 2}, {"col1": 3, "col2": 4}],
"meta1": 1,
},
{
"rows": [{"col1": 5, "col2": 6}, {"col1": 7, "col2": 8}],
"meta1": 2,
},
],
}
df = pd.json_normalize(data, record_path=["level1", "rows"], meta=[["level1", "meta1"]])
print(df)
# col1 col2 level1.meta1
# 0 1 2 1
# 1 3 4 1
# 2 5 6 2
# 3 7 8 2
While I'm not very familiar with this functionality, I believe the intention is to have metadata that sits alongside each collection of records.
If the metadata does not sit alongside each collection of records, then I think the result would necessarily be a constant column. Is that your desire @DavidNaizheZhou?
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Description of the Issue
This reproducible example demonstrates the challenges and potential pitfalls when using
pandas.json_normalize()
to extract and flatten hierarchical data structures with nested metadata:Data Structure
The
data
dictionary is multi-layered, with nested dictionaries and a list of dictionaries (rows
) underlevel1
. Additionally,meta1
is structured as a dictionary containing subfields.Successful Normalization
The first call to
pd.json_normalize()
extracts the data fromrows
underlevel1 and includes
meta1as a top-level metadata field. This works as intended because
meta1 is accessed directly as a single key.Output:
KeyError with Nested Meta Fields
The second
pd.json_normalize()
call attempts to extract subfields frommeta1
using a nested path (meta=[["meta1", "meta_sub1"]]
). This results in aKeyError
becausejson_normalize()
does not natively support nested lists for specifying paths within themeta
parameter.Expected Behavior
Installed Versions