pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.3k stars 17.8k forks source link

json_normalize does not handle nested meta paths when also using a nested record_path #27220

Closed connormcmk closed 4 years ago

connormcmk commented 5 years ago

Code Sample

from pandas.io.json import json_normalize

data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {
              'governor': 'Rick Scott'
         },
         'counties': [{'name': 'Dade', 'population': 12345},
                     {'name': 'Broward', 'population': 40000},
                     {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {
              'governor': 'John Kasich'
         },
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]

print(json_normalize(data, ['counties', 'name'], ['state', 'shortname', ['info', 'governor']], errors='ignore'))

Returns

    0    state shortname  info.governor
0   D  Florida        FL            NaN
1   a  Florida        FL            NaN
2   d  Florida        FL            NaN
3   e  Florida        FL            NaN
4   B  Florida        FL            NaN
5   r  Florida        FL            NaN
6   o  Florida        FL            NaN
7   w  Florida        FL            NaN
8   a  Florida        FL            NaN
9   r  Florida        FL            NaN
10  d  Florida        FL            NaN
11  P  Florida        FL            NaN
12  a  Florida        FL            NaN
13  l  Florida        FL            NaN
14  m  Florida        FL            NaN
15     Florida        FL            NaN
16  B  Florida        FL            NaN
17  e  Florida        FL            NaN
18  a  Florida        FL            NaN

Problem description

Running json_normalize on a nested record_path with a nested meta argument gives an error that says it cannot find info.governer.

This is inconsistent because running it again, without the nested meta argument, successfully returns the data:

# using the same data from before
print(json_normalize(data, ['counties', 'name'], ['state', 'shortname', 'info'], errors='ignore'))
    0    state shortname                         info
0   D  Florida        FL   {'governor': 'Rick Scott'}
1   a  Florida        FL   {'governor': 'Rick Scott'}
2   d  Florida        FL   {'governor': 'Rick Scott'}
3   e  Florida        FL   {'governor': 'Rick Scott'}
4   B  Florida        FL   {'governor': 'Rick Scott'}
5   r  Florida        FL   {'governor': 'Rick Scott'}
6   o  Florida        FL   {'governor': 'Rick Scott'}
7   w  Florida        FL   {'governor': 'Rick Scott'}

Similarly, using a non-nested record path also works (in fact, this is the exact sample example that can be found in the json_normalize pandas documentation).

# using the same data from before
print(json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']]))
         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott
3      Summit        1234     Ohio        OH   John Kasich
4    Cuyahoga        1337     Ohio        OH   John Kasich

The result is that it is much more difficult to access nested meta data.

May be related to: https://github.com/pandas-dev/pandas/issues/21537

Expected Output

    0    state shortname  info.governor
0   D  Florida        FL            Rick Scott
1   a  Florida        FL            Rick Scott
2   d  Florida        FL            Rick Scott
3   e  Florida        FL            Rick Scott
4   B  Florida        FL            Rick Scott
5   r  Florida        FL            Rick Scott
6   o  Florida        FL            Rick Scott
7   w  Florida        FL            Rick Scott
8   a  Florida        FL            Rick Scott
9   r  Florida        FL            Rick Scott
10  d  Florida        FL            Rick Scott
11  P  Florida        FL            Rick Scott
12  a  Florida        FL            Rick Scott
13  l  Florida        FL            Rick Scott
14  m  Florida        FL            Rick Scott
15     Florida        FL            Rick Scott
16  B  Florida        FL            Rick Scott
17  e  Florida        FL            Rick Scott
18  a  Florida        FL            Rick Scott

Note that ['counties', 'name'] is an arbitrary list of strings to use as a record path, and that this example is contrived (who really needs a table comprised of each letter of a string?). However, many real scenarios can be constructed that require this sort of nested record_path extraction along with nested meta path extraction.

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: None python: 3.7.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.125-linuxkit machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.24.2 pytest: None pip: 19.1 setuptools: 41.0.1 Cython: 0.29.7 numpy: 1.15.4 scipy: 1.2.1 pyarrow: None xarray: None IPython: 7.5.0 sphinx: None patsy: 0.5.1 dateutil: 2.8.0 pytz: 2019.1 blosc: None bottleneck: None tables: None numexpr: 2.6.9 feather: None matplotlib: 3.0.3 openpyxl: None xlrd: 1.2.0 xlwt: None xlsxwriter: None lxml.etree: None bs4: 4.7.1 html5lib: None sqlalchemy: 1.3.3 pymysql: None psycopg2: None jinja2: 2.10.1 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
WillAyd commented 5 years ago

I think makes sense so as long as the nested meta path points to an object and not an array. Would welcome a PR

cc @bhavaniravi

another-green commented 5 years ago

I am working with JSON a lot recently. I will give a try.

LTe commented 4 years ago

Test proposed by @yanglinlee https://github.com/pandas-dev/pandas/pull/27667/files#diff-490b1195dddd7f1523a30f865801ed17R291-R313

def test_nested_meta_path_with_nested_record_path(self, state_data):
    # GH 27220
    result = json_normalize(
        data=state_data,
        record_path=["counties", "name"],
        meta=["state", "shortname", ["info", "governor"]],
        errors="ignore",
    )

    ex_data = [
        [
            i
            for word in ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"]
            for i in word
        ],
        ["Florida"] * 21 + ["Ohio"] * 14,
        ["FL"] * 21 + ["OH"] * 14,
        ["Rick Scott"] * 21 + ["John Kasich"] * 14,
    ]
    expected = DataFrame(ex_data).T
    expected.columns = [0, "state", "shortname", "info.governor"]

    tm.assert_frame_equal(result, expected)

After some changes to work with new panda version https://github.com/LTe/pandas/tree/json_improve

diff --git a/doc/source/whatsnew/v1.1.0.rst b/doc/source/whatsnew/v1.1.0.rst
index 2a641a37b..af62031fa 100644
--- a/doc/source/whatsnew/v1.1.0.rst
+++ b/doc/source/whatsnew/v1.1.0.rst
@@ -576,6 +576,7 @@ I/O
 - Bug in :meth:`read_excel` did not correctly handle multiple embedded spaces in OpenDocument text cells. (:issue:`32207`)
 - Bug in :meth:`read_json` was raising ``TypeError`` when reading a list of booleans into a Series. (:issue:`31464`)
 - Bug in :func:`pandas.io.json.json_normalize` where location specified by `record_path` doesn't point to an array. (:issue:`26284`)
+- Bug in :meth:`pandas.io.json.json_normalize` when nested meta paths with a nested record path. (:issue:`27220`)

 Plotting
 ^^^^^^^^
diff --git a/pandas/tests/io/json/test_normalize.py b/pandas/tests/io/json/test_normalize.py
index 4a32f3809..6624b42b1 100644
--- a/pandas/tests/io/json/test_normalize.py
+++ b/pandas/tests/io/json/test_normalize.py
@@ -284,6 +284,27 @@ class TestJSONNormalize:
         expected = DataFrame(ex_data, columns=result.columns)
         tm.assert_frame_equal(result, expected)

+    def test_nested_meta_path_with_nested_record_path(self, state_data):
+        # GH 27220
+        result = json_normalize(
+            data=state_data,
+            record_path=["counties"],
+            meta=["state", "shortname", ["info", "governor"]],
+            errors="ignore",
+        )
+        ex_data = {
+            "name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"],
+            "population": [12345, 40000, 60000, 1234, 1337],
+            "state": ["Florida"] * 3 + ["Ohio"] * 2,
+            "shortname": ["FL"] * 3 + ["OH"] * 2,
+            "info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2,
+        }
+        expected = DataFrame(
+            ex_data,
+            columns=["name", "population", "state", "shortname", "info.governor"],
+        )
+        tm.assert_frame_equal(result, expected)
+
     def test_meta_name_conflict(self):
         data = [
             {

is working just fine with current master. CI run: https://dev.azure.com/piotrnielacny/piotrnielacny/_build/results?buildId=9&view=results

Even there is almost the same test right now here: https://github.com/pandas-dev/pandas/blob/cd5250211b3c79ec41da0f2a1ce0f4bfa3d1f3e6/pandas/tests/io/json/test_normalize.py#L251-L285

def test_shallow_nested(self):
    data = [
        {
            "state": "Florida",
            "shortname": "FL",
            "info": {"governor": "Rick Scott"},
            "counties": [
                {"name": "Dade", "population": 12345},
                {"name": "Broward", "population": 40000},
                {"name": "Palm Beach", "population": 60000},
            ],
        },
        {
            "state": "Ohio",
            "shortname": "OH",
            "info": {"governor": "John Kasich"},
            "counties": [
                {"name": "Summit", "population": 1234},
                {"name": "Cuyahoga", "population": 1337},
            ],
        },
    ]

    result = json_normalize(
        data, "counties", ["state", "shortname", ["info", "governor"]]
    )
    ex_data = {
        "name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"],
        "state": ["Florida"] * 3 + ["Ohio"] * 2,
        "shortname": ["FL", "FL", "FL", "OH", "OH"],
        "info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2,
        "population": [12345, 40000, 60000, 1234, 1337],
    }
    expected = DataFrame(ex_data, columns=result.columns)
    tm.assert_frame_equal(result, expected)

I think this issue is fixed in master and we can close it.

LTe commented 4 years ago

cc @WillAyd @jreback

mroeschke commented 4 years ago

Happy to add that test to the test suite as a regression test

MarcoGorelli commented 4 years ago

I think this issue is fixed in master and we can close it.

I'm probably missing something (if so, no need to reply, sorry for the disruption) but isn't the original issue still not working? If I run

from pandas import json_normalize

data = [
    {
        "state": "Florida",
        "shortname": "FL",
        "info": {"governor": "Rick Scott"},
        "counties": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 40000},
            {"name": "Palm Beach", "population": 60000},
        ],
    },
    {
        "state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich"},
        "counties": [
            {"name": "Summit", "population": 1234},
            {"name": "Cuyahoga", "population": 1337},
        ],
    },
]

print(
    json_normalize(
        data,
        ["counties", "name"],
        ["state", "shortname", ["info", "governor"]],
        errors="ignore",
    )
)

on master then I get

TypeError: {'name': 'Dade', 'population': 12345} has non list value Dade for path name. Must be list or null.

It seems to me that the proposed regression test is for the case

print(json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']]))

which, according to the OP, was already working

LTe commented 4 years ago

@MarcoGorelli TypeError is expected, because string is not treated as a list. But I think regression test is already there:

https://github.com/pandas-dev/pandas/blob/cd5250211b3c79ec41da0f2a1ce0f4bfa3d1f3e6/pandas/tests/io/json/test_normalize.py#L223

seidnerj commented 3 days ago

Hey all,

Not sure if this is a regression or a different issue, but I'm seeing this issue with a slight variation. The following works fine:

data = [
        {
            "state": "Florida",
            "shortname": "FL",
            "info": {"governor": 
                    "Rick Scott"},
            "counties": {
                "nodes": [
                    {"name": "Dade", "population": 12345},
                    {"name": "Broward", "population": 40000},
                    {"name": "Palm Beach", "population": 60000},
                ],
                "__typename": "County"
            }
        }
    ]

pd.json_normalize(data, record_path=["counties", "nodes"], meta=["state", "shortname", ["info"]])
         name  population    state shortname                        info
0        Dade       12345  Florida        FL  {'governor': 'Rick Scott'}
1     Broward       40000  Florida        FL  {'governor': 'Rick Scott'}
2  Palm Beach       60000  Florida        FL  {'governor': 'Rick Scott'}

However, the following does not:

pd.json_normalize(data, record_path=["counties", "nodes"], meta=["state", "shortname", ["info", "governor"]])
Traceback (most recent call last):
  File "/Users/user/Applications/PyCharm.app/Contents/plugins/python/helpers-pro/pydevd_asyncio/pydevd_asyncio_utils.py", line 117, in _exec_async_code
    result = func()
             ^^^^^^
  File "<input>", line 1, in <module>
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 517, in json_normalize
    _recursive_extract(data, record_path, {}, level=0)
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 496, in _recursive_extract
    _recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1)
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 513, in _recursive_extract
    meta_val = _pull_field(obj, val[level:])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 408, in _pull_field
    raise KeyError(
KeyError: "Key 'governor' not found. To replace missing values of 'governor' with np.nan, pass in errors='ignore'"

Dropping the "nodes" key again gets this working:

data = [
        {
            "state": "Florida",
            "shortname": "FL",
            "info": {"governor": 
                    "Rick Scott"},
            "counties": [
                    {"name": "Dade", "population": 12345},
                    {"name": "Broward", "population": 40000},
                    {"name": "Palm Beach", "population": 60000}
                ]
        }
    ]

pd.json_normalize(data, record_path=["counties"], meta=["state", "shortname", ["info", "governor"]])
         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott

These are the versions I'm using (dropped all "None" for brevity)

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit                : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python                : 3.11.8.final.0
python-bits           : 64
OS                    : Darwin
OS-release            : 24.0.0
Version               : Darwin Kernel Version 24.0.0: Mon Aug 12 20:51:54 PDT 2024; root:xnu-11215.1.10~2/RELEASE_ARM64_T6000
machine               : arm64
processor             : arm
byteorder             : little
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8
pandas                : 2.2.1
numpy                 : 1.26.4
pytz                  : 2024.2
dateutil              : 2.9.0.post0
setuptools            : 74.1.2
pip                   : 24.2
html5lib              : 1.1
jinja2                : 3.1.4
tzdata                : 2024.1