pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.62k stars 1.69k forks source link

Normalize JSON #7078

Open romanzdk opened 1 year ago

romanzdk commented 1 year ago

Problem description

It would be perfect to have similar functionality to pandas json_normalize - https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html

cmdlineluser commented 1 year ago

Incase it's useful information: the last 2 examples already "work" with .read_json()

import io
import polars as pl

df = pl.read_json(io.BytesIO(b"""
[
    {
        "id": 1,
        "name": "Cole Volk",
        "fitness": {"height": 130, "weight": 60}
    },
    {"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}},
    {
        "id": 2,
        "name": "Faye Raker",
        "fitness": {"height": 130, "weight": 60}
    }
]
"""))
shape: (3, 3)
┌──────┬────────────┬───────────┐
│ id   | name       | fitness   │
│ ---  | ---        | ---       │
│ i64  | str        | struct[2] │
╞══════╪════════════╪═══════════╡
│ 1    | Cole Volk  | {130,60}  │
│ null | Mark Reg   | {130,60}  │
│ 2    | Faye Raker | {130,60}  │
└──────┴────────────┴───────────┘

You can then .unnest() the fitness column:

>>> df.unnest("fitness")
shape: (3, 4)
┌──────┬────────────┬────────┬────────┐
│ id   | name       | height | weight │
│ ---  | ---        | ---    | ---    │
│ i64  | str        | i64    | i64    │
╞══════╪════════════╪════════╪════════╡
│ 1    | Cole Volk  | 130    | 60     │
│ null | Mark Reg   | 130    | 60     │
│ 2    | Faye Raker | 130    | 60     │
└──────┴────────────┴────────┴────────┘
df = pl.read_json(io.BytesIO(b"""
[
    {
        "state": "Florida",
        "shortname": "FL",
        "info": {"governor": "Rick Scott"},
        "counties": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 40000},
            {"name": "Palm Beach", "population": 60000}
        ]
    },
    {
        "state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich"},
        "counties": [
            {"name": "Summit", "population": 1234},
            {"name": "Cuyahoga", "population": 1337}
        ]
    }
]
"""))
shape: (2, 4)
┌─────────┬───────────┬─────────────────┬─────────────────────────────────────┐
│ state   | shortname | info            | counties                            │
│ ---     | ---       | ---             | ---                                 │
│ str     | str       | struct[1]       | list[struct[2]]                     │
╞═════════╪═══════════╪═════════════════╪═════════════════════════════════════╡
│ Florida | FL        | {"Rick Scott"}  | [{"Dade",12345}, {"Broward",4000... │
│ Ohio    | OH        | {"John Kasich"} | [{"Summit",1234}, {"Cuyahoga",13... │
└─────────┴───────────┴─────────────────┴─────────────────────────────────────┘

.unnest() and .explode()

>>> df.unnest("info").explode("counties").unnest("counties")
shape: (5, 5)
┌─────────┬───────────┬─────────────┬────────────┬────────────┐
│ state   | shortname | governor    | name       | population │
│ ---     | ---       | ---         | ---        | ---        │
│ str     | str       | str         | str        | i64        │
╞═════════╪═══════════╪═════════════╪════════════╪════════════╡
│ Florida | FL        | Rick Scott  | Dade       | 12345      │
│ Florida | FL        | Rick Scott  | Broward    | 40000      │
│ Florida | FL        | Rick Scott  | Palm Beach | 60000      │
│ Ohio    | OH        | John Kasich | Summit     | 1234       │
│ Ohio    | OH        | John Kasich | Cuyahoga   | 1337       │
└─────────┴───────────┴─────────────┴────────────┴────────────┘

Seems like it could be useful if .unnest() had the option to prefix the resulting column names to avoid name clashes.

amzar96 commented 1 month ago

DuplicateError: column with name 'name' has more than one occurrences Agree with @cmdlineluser . Should have option to add custom prefix as well.

cmdlineluser commented 1 month ago

@amzar96 .name.map_fields() has since been added which can be used before unnesting.