Open romanzdk opened 1 year ago
Incase it's useful information: the last 2 examples already "work" with .read_json()
import io
import polars as pl
df = pl.read_json(io.BytesIO(b"""
[
{
"id": 1,
"name": "Cole Volk",
"fitness": {"height": 130, "weight": 60}
},
{"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}},
{
"id": 2,
"name": "Faye Raker",
"fitness": {"height": 130, "weight": 60}
}
]
"""))
shape: (3, 3)
┌──────┬────────────┬───────────┐
│ id | name | fitness │
│ --- | --- | --- │
│ i64 | str | struct[2] │
╞══════╪════════════╪═══════════╡
│ 1 | Cole Volk | {130,60} │
│ null | Mark Reg | {130,60} │
│ 2 | Faye Raker | {130,60} │
└──────┴────────────┴───────────┘
You can then .unnest()
the fitness
column:
>>> df.unnest("fitness")
shape: (3, 4)
┌──────┬────────────┬────────┬────────┐
│ id | name | height | weight │
│ --- | --- | --- | --- │
│ i64 | str | i64 | i64 │
╞══════╪════════════╪════════╪════════╡
│ 1 | Cole Volk | 130 | 60 │
│ null | Mark Reg | 130 | 60 │
│ 2 | Faye Raker | 130 | 60 │
└──────┴────────────┴────────┴────────┘
df = pl.read_json(io.BytesIO(b"""
[
{
"state": "Florida",
"shortname": "FL",
"info": {"governor": "Rick Scott"},
"counties": [
{"name": "Dade", "population": 12345},
{"name": "Broward", "population": 40000},
{"name": "Palm Beach", "population": 60000}
]
},
{
"state": "Ohio",
"shortname": "OH",
"info": {"governor": "John Kasich"},
"counties": [
{"name": "Summit", "population": 1234},
{"name": "Cuyahoga", "population": 1337}
]
}
]
"""))
shape: (2, 4)
┌─────────┬───────────┬─────────────────┬─────────────────────────────────────┐
│ state | shortname | info | counties │
│ --- | --- | --- | --- │
│ str | str | struct[1] | list[struct[2]] │
╞═════════╪═══════════╪═════════════════╪═════════════════════════════════════╡
│ Florida | FL | {"Rick Scott"} | [{"Dade",12345}, {"Broward",4000... │
│ Ohio | OH | {"John Kasich"} | [{"Summit",1234}, {"Cuyahoga",13... │
└─────────┴───────────┴─────────────────┴─────────────────────────────────────┘
.unnest()
and .explode()
>>> df.unnest("info").explode("counties").unnest("counties")
shape: (5, 5)
┌─────────┬───────────┬─────────────┬────────────┬────────────┐
│ state | shortname | governor | name | population │
│ --- | --- | --- | --- | --- │
│ str | str | str | str | i64 │
╞═════════╪═══════════╪═════════════╪════════════╪════════════╡
│ Florida | FL | Rick Scott | Dade | 12345 │
│ Florida | FL | Rick Scott | Broward | 40000 │
│ Florida | FL | Rick Scott | Palm Beach | 60000 │
│ Ohio | OH | John Kasich | Summit | 1234 │
│ Ohio | OH | John Kasich | Cuyahoga | 1337 │
└─────────┴───────────┴─────────────┴────────────┴────────────┘
Seems like it could be useful if .unnest()
had the option to prefix the resulting column names to avoid name clashes.
DuplicateError: column with name 'name' has more than one occurrences
Agree with @cmdlineluser . Should have option to add custom prefix as well.
@amzar96 .name.map_fields()
has since been added which can be used before unnesting.
Problem description
It would be perfect to have similar functionality to pandas
json_normalize
- https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html