Closed connormcmk closed 4 years ago
I think makes sense so as long as the nested meta path points to an object and not an array. Would welcome a PR
cc @bhavaniravi
I am working with JSON a lot recently. I will give a try.
Test proposed by @yanglinlee https://github.com/pandas-dev/pandas/pull/27667/files#diff-490b1195dddd7f1523a30f865801ed17R291-R313
def test_nested_meta_path_with_nested_record_path(self, state_data):
# GH 27220
result = json_normalize(
data=state_data,
record_path=["counties", "name"],
meta=["state", "shortname", ["info", "governor"]],
errors="ignore",
)
ex_data = [
[
i
for word in ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"]
for i in word
],
["Florida"] * 21 + ["Ohio"] * 14,
["FL"] * 21 + ["OH"] * 14,
["Rick Scott"] * 21 + ["John Kasich"] * 14,
]
expected = DataFrame(ex_data).T
expected.columns = [0, "state", "shortname", "info.governor"]
tm.assert_frame_equal(result, expected)
After some changes to work with new panda version https://github.com/LTe/pandas/tree/json_improve
diff --git a/doc/source/whatsnew/v1.1.0.rst b/doc/source/whatsnew/v1.1.0.rst
index 2a641a37b..af62031fa 100644
--- a/doc/source/whatsnew/v1.1.0.rst
+++ b/doc/source/whatsnew/v1.1.0.rst
@@ -576,6 +576,7 @@ I/O
- Bug in :meth:`read_excel` did not correctly handle multiple embedded spaces in OpenDocument text cells. (:issue:`32207`)
- Bug in :meth:`read_json` was raising ``TypeError`` when reading a list of booleans into a Series. (:issue:`31464`)
- Bug in :func:`pandas.io.json.json_normalize` where location specified by `record_path` doesn't point to an array. (:issue:`26284`)
+- Bug in :meth:`pandas.io.json.json_normalize` when nested meta paths with a nested record path. (:issue:`27220`)
Plotting
^^^^^^^^
diff --git a/pandas/tests/io/json/test_normalize.py b/pandas/tests/io/json/test_normalize.py
index 4a32f3809..6624b42b1 100644
--- a/pandas/tests/io/json/test_normalize.py
+++ b/pandas/tests/io/json/test_normalize.py
@@ -284,6 +284,27 @@ class TestJSONNormalize:
expected = DataFrame(ex_data, columns=result.columns)
tm.assert_frame_equal(result, expected)
+ def test_nested_meta_path_with_nested_record_path(self, state_data):
+ # GH 27220
+ result = json_normalize(
+ data=state_data,
+ record_path=["counties"],
+ meta=["state", "shortname", ["info", "governor"]],
+ errors="ignore",
+ )
+ ex_data = {
+ "name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"],
+ "population": [12345, 40000, 60000, 1234, 1337],
+ "state": ["Florida"] * 3 + ["Ohio"] * 2,
+ "shortname": ["FL"] * 3 + ["OH"] * 2,
+ "info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2,
+ }
+ expected = DataFrame(
+ ex_data,
+ columns=["name", "population", "state", "shortname", "info.governor"],
+ )
+ tm.assert_frame_equal(result, expected)
+
def test_meta_name_conflict(self):
data = [
{
is working just fine with current master. CI run: https://dev.azure.com/piotrnielacny/piotrnielacny/_build/results?buildId=9&view=results
Even there is almost the same test right now here: https://github.com/pandas-dev/pandas/blob/cd5250211b3c79ec41da0f2a1ce0f4bfa3d1f3e6/pandas/tests/io/json/test_normalize.py#L251-L285
def test_shallow_nested(self):
data = [
{
"state": "Florida",
"shortname": "FL",
"info": {"governor": "Rick Scott"},
"counties": [
{"name": "Dade", "population": 12345},
{"name": "Broward", "population": 40000},
{"name": "Palm Beach", "population": 60000},
],
},
{
"state": "Ohio",
"shortname": "OH",
"info": {"governor": "John Kasich"},
"counties": [
{"name": "Summit", "population": 1234},
{"name": "Cuyahoga", "population": 1337},
],
},
]
result = json_normalize(
data, "counties", ["state", "shortname", ["info", "governor"]]
)
ex_data = {
"name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"],
"state": ["Florida"] * 3 + ["Ohio"] * 2,
"shortname": ["FL", "FL", "FL", "OH", "OH"],
"info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2,
"population": [12345, 40000, 60000, 1234, 1337],
}
expected = DataFrame(ex_data, columns=result.columns)
tm.assert_frame_equal(result, expected)
I think this issue is fixed in master and we can close it.
cc @WillAyd @jreback
Happy to add that test to the test suite as a regression test
I think this issue is fixed in master and we can close it.
I'm probably missing something (if so, no need to reply, sorry for the disruption) but isn't the original issue still not working? If I run
from pandas import json_normalize
data = [
{
"state": "Florida",
"shortname": "FL",
"info": {"governor": "Rick Scott"},
"counties": [
{"name": "Dade", "population": 12345},
{"name": "Broward", "population": 40000},
{"name": "Palm Beach", "population": 60000},
],
},
{
"state": "Ohio",
"shortname": "OH",
"info": {"governor": "John Kasich"},
"counties": [
{"name": "Summit", "population": 1234},
{"name": "Cuyahoga", "population": 1337},
],
},
]
print(
json_normalize(
data,
["counties", "name"],
["state", "shortname", ["info", "governor"]],
errors="ignore",
)
)
on master then I get
TypeError: {'name': 'Dade', 'population': 12345} has non list value Dade for path name. Must be list or null.
It seems to me that the proposed regression test is for the case
print(json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']]))
which, according to the OP, was already working
@MarcoGorelli TypeError
is expected, because string
is not treated as a list. But I think regression test is already there:
Hey all,
Not sure if this is a regression or a different issue, but I'm seeing this issue with a slight variation. The following works fine:
data = [
{
"state": "Florida",
"shortname": "FL",
"info": {"governor":
"Rick Scott"},
"counties": {
"nodes": [
{"name": "Dade", "population": 12345},
{"name": "Broward", "population": 40000},
{"name": "Palm Beach", "population": 60000},
],
"__typename": "County"
}
}
]
pd.json_normalize(data, record_path=["counties", "nodes"], meta=["state", "shortname", ["info"]])
name population state shortname info
0 Dade 12345 Florida FL {'governor': 'Rick Scott'}
1 Broward 40000 Florida FL {'governor': 'Rick Scott'}
2 Palm Beach 60000 Florida FL {'governor': 'Rick Scott'}
However, the following does not:
pd.json_normalize(data, record_path=["counties", "nodes"], meta=["state", "shortname", ["info", "governor"]])
Traceback (most recent call last):
File "/Users/user/Applications/PyCharm.app/Contents/plugins/python/helpers-pro/pydevd_asyncio/pydevd_asyncio_utils.py", line 117, in _exec_async_code
result = func()
^^^^^^
File "<input>", line 1, in <module>
File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 517, in json_normalize
_recursive_extract(data, record_path, {}, level=0)
File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 496, in _recursive_extract
_recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1)
File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 513, in _recursive_extract
meta_val = _pull_field(obj, val[level:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 408, in _pull_field
raise KeyError(
KeyError: "Key 'governor' not found. To replace missing values of 'governor' with np.nan, pass in errors='ignore'"
Dropping the "nodes" key again gets this working:
data = [
{
"state": "Florida",
"shortname": "FL",
"info": {"governor":
"Rick Scott"},
"counties": [
{"name": "Dade", "population": 12345},
{"name": "Broward", "population": 40000},
{"name": "Palm Beach", "population": 60000}
]
}
]
pd.json_normalize(data, record_path=["counties"], meta=["state", "shortname", ["info", "governor"]])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
These are the versions I'm using (dropped all "None" for brevity)
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python : 3.11.8.final.0
python-bits : 64
OS : Darwin
OS-release : 24.0.0
Version : Darwin Kernel Version 24.0.0: Mon Aug 12 20:51:54 PDT 2024; root:xnu-11215.1.10~2/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.1
numpy : 1.26.4
pytz : 2024.2
dateutil : 2.9.0.post0
setuptools : 74.1.2
pip : 24.2
html5lib : 1.1
jinja2 : 3.1.4
tzdata : 2024.1
Code Sample
Returns
Problem description
Running json_normalize on a nested record_path with a nested meta argument gives an error that says it cannot find info.governer.
This is inconsistent because running it again, without the nested meta argument, successfully returns the data:
Similarly, using a non-nested record path also works (in fact, this is the exact sample example that can be found in the json_normalize pandas documentation).
The result is that it is much more difficult to access nested meta data.
May be related to: https://github.com/pandas-dev/pandas/issues/21537
Expected Output
Note that
['counties', 'name']
is an arbitrary list of strings to use as a record path, and that this example is contrived (who really needs a table comprised of each letter of a string?). However, many real scenarios can be constructed that require this sort of nested record_path extraction along with nested meta path extraction.Output of
pd.show_versions()