pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.87k stars 18.02k forks source link

QST: #58930

Open phan480 opened 5 months ago

phan480 commented 5 months ago

Research

Link to question on StackOverflow

Pandas json_normalize question

Question about pandas

I am trying to understand an issue I encounter with json_normalize. This may be a bug? The following code:

import pandas
data = [{
        "id": 1039835,
        "status": "on_track",
        "development_plan": {
            "id": 23858,
            "items": [{
                    "id": 176498,
                    "item_name": "itemname 1"
                }, {
                    "id": 172450,
                    "item_name": "itemname 2"
                }, {
                    "id": 172451,
                    "item_name": "itemname 3"
                }
            ]
        },
        "user": {
            "id": 714067
        }
    }
]
print(
    pandas.json_normalize(data,
                          ['development_plan', 'items'],
                          [['development_plan', 'id'],['status'],['user', 'id'],['user']],
                          errors = 'ignore'
    )
)

Returns this dataframe:

   id   item_name development_plan.id    status user.id            user

0 176498 itemname 1 23858 on_track 23858 {'id': 714067} 1 172450 itemname 2 23858 on_track 23858 {'id': 714067} 2 172451 itemname 3 23858 on_track 23858 {'id': 714067}

The question is why the user.id is wrong? It returns the development_plan.id. Adding the complete user dictionary does return the correct user.id though.

Is this a bug or am I missing something?

RamMarthi9 commented 5 months ago

Hi phan480,

Use this piece of code: I think this should help u. Ping me if this is working. import pandas as pd

data = [{ "id": 1039835, "status": "on_track", "development_plan": { "id": 23858, "items": [{ "id": 176498, "item_name": "itemname 1" }, { "id": 172450, "item_name": "itemname 2" }, { "id": 172451, "item_name": "itemname 3" } ] }, "user": { "id": 714067 } } ]

Extract data

records = [] for entry in data: status = entry['status'] user_id = entry['user']['id'] development_plan_id = entry['development_plan']['id'] for item in entry['development_plan']['items']: record = { 'id': item['id'], 'item_name': item['item_name'], 'development_plan_id': development_plan_id, 'status': status, 'user_id': user_id } records.append(record)

Create DataFrame

df = pd.DataFrame(records)

print(df)

I got the following result: id item_name development_plan_id status user_id 0 176498 itemname 1 23858 on_track 714067 1 172450 itemname 2 23858 on_track 714067 2 172451 itemname 3 23858 on_track 714067

phan480 commented 5 months ago

Thank you for your response, yes it works. But doesn't use the json_normalize functionality? I am using this function to parse a multitude of api's and their responses. I configure their record_path and meta data columns in configuration data.

I would like to avoid creating a unique way to parse every API.

yuanx749 commented 1 month ago

I believe there is a bug, and seems this pr #59225 can fix it.