pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.01k stars 1.83k forks source link

Additional Parameter for json_normalize #17672

Open jrasband-dev opened 1 month ago

jrasband-dev commented 1 month ago

Description

Request

Would be great if we could select a path in the JSON file to start parsing records at. Pandas has this functionality (see below).

Usage

data source: https://catalog.data.gov/dataset/electric-vehicle-population-data

Using Polars

import json
import polars as pl 

f = open("../../Datasets/Electric_Vehicles.json",'r')
data = json.load(f)
df = pl.json_normalize(data,strict=False)
df

OUTPUT

┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ data      ┆ meta.view ┆ meta.view ┆ meta.view ┆ … ┆ meta.view ┆ meta.view ┆ meta.view ┆ meta.vie │
│ ---       ┆ .id       ┆ .name     ┆ .assetTyp ┆   ┆ .tableAut ┆ .tableAut ┆ .tags     ┆ w.flags  │
│ list[list ┆ ---       ┆ ---       ┆ e         ┆   ┆ hor.type  ┆ hor.flags ┆ ---       ┆ ---      │
│ [str]]    ┆ str       ┆ str       ┆ ---       ┆   ┆ ---       ┆ ---       ┆ list[str] ┆ list[str │
│           ┆           ┆           ┆ str       ┆   ┆ str       ┆ list[str] ┆           ┆ ]        │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ [["row-d7 ┆ f6w7-q2d2 ┆ Electric  ┆ dataset   ┆ … ┆ interacti ┆ ["accepte ┆ ["tesla", ┆ ["defaul │
│ 7h.fy7m.e ┆           ┆ Vehicle   ┆           ┆   ┆ ve        ┆ dEula",   ┆ "leaf", … ┆ t", "own │
│ y66",     ┆           ┆ Populatio ┆           ┆   ┆           ┆ "mayBeSto ┆ "rao_veh" ┆ erMayBeC │
│ "00000…   ┆           ┆ n Da…     ┆           ┆   ┆           ┆ ries…     ┆ ]         ┆ ontact…  │
└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘

Using Pandas with record path argument

import json
import pandas as pd

f = open("../../Datasets/Electric_Vehicles.json",'r')
data = json.load(f)
df = pd.json_normalize(data, record_path='data')
df

OUTPUT

                        0                                     1   2    
0       row-d77h.fy7m.ey66  00000000-0000-0000-4C16-DAD83D16BE64   0  \
1       row-95h2~76n9~t6c9  00000000-0000-0000-CB85-161B4B6EBEE5   0   
2       row-7cud~im3p_9g2m  00000000-0000-0000-ED53-19A72CAA2B6D   0   
3       row-3ey5-tfyy~hruc  00000000-0000-0000-D6A7-A287F876E567   0   
4       row-h6ea_9nav-rcvr  00000000-0000-0000-8651-BE114495116F   0   
...                    ...                                   ...  ..   
191402  row-j544~qpj7-mq87  00000000-0000-0000-13D5-05626634EDE6   0   
191403  row-ertp~645x-t9pj  00000000-0000-0000-5768-68B8783B0107   0   
191404  row-5wyy~zwx8-2r9a  00000000-0000-0000-FA3A-2DAC2F13B1ED   0   
191405  row-mevd_k52q~9z9m  00000000-0000-0000-4F11-698F6129FFAE   0   
191406  row-6n89~ue2i-wgfr  00000000-0000-0000-C373-B6A30CFF220D   0   

                3     4           5     6    7           8             9    
0       1718719976  None  1718720088  None  { }  5YJSA1E22K          King  \
1       1718719976  None  1718720088  None  { }  3MW39FS05R        Yakima   
2       1718719976  None  1718720088  None  { }  1N4AZ0CP0F          King   
3       1718719976  None  1718720088  None  { }  5YJSA1H20F     Snohomish   
4       1718719976  None  1718720088  None  { }  JTMAB3FV1N        Yakima   
...            ...   ...         ...   ...  ...         ...           ...   
191402  1718719976  None  1718720128  None  { }  5YJ3E1EA0K        Pierce   
191403  1718719976  None  1718720128  None  { }  JN1BF0BA7P        Pierce   
191404  1718719976  None  1718720128  None  { }  7SAYGDEE4R          King   
191405  1718719976  None  1718720128  None  { }  7SAXCBE65N  Grays Harbor   
191406  1718719976  None  1718720128  None  { }  7SAYGAEEXP        Benton   
...
191404   8   6  
191405   6  26  
191406   4  12

Pandas with nested paths

import json
import pandas as pd

f = open("../../Datasets/Electric_Vehicles.json",'r')
data = json.load(f)
df = pd.json_normalize(data,record_path=['meta','view','columns'])
df

OUTPUT

0          -1                                                sid    meta_data  \
1          -1                                                 id    meta_data   
2          -1                                           position    meta_data   
3          -1                                         created_at    meta_data   
4          -1                                       created_meta    meta_data   
5          -1                                         updated_at    meta_data   
6          -1                                       updated_meta    meta_data   
7          -1                                               meta    meta_data   
8   583288589                                         VIN (1-10)         text   
9   583288590                                             County         text   
10  583288591                                               City         text   
11  583288592                                              State         text   
12  583288593                                        Postal Code         text   
13  583288594                                         Model Year         text   
14  583288595                                               Make         text   
15  583288596                                              Model         text   
16  583288597                              Electric Vehicle Type         text   
17  583288598  Clean Alternative Fuel Vehicle (CAFV) Eligibility         text   
18  583288599                                     Electric Range       number   
19  583288600                                          Base MSRP       number   
20  583288610                               Legislative District         text   
21  583288602                                     DOL Vehicle ID         text   
22  583288603                                   Vehicle Location        point   
23  583288605                                   Electric Utility         text  
jrasband-dev commented 1 month ago

@ritchie46 Super excited for the new json_normalize function in polars 1.0.0! Would love to hear your thoughts on this as a further development item.

For additional context, a little over a year ago I put together an in depth video on how the pandas json_normalize function works (something you don't always get from just reading the documentation). https://www.youtube.com/watch?v=5JyLCsQ2JNc

cmdlineluser commented 1 month ago

~Just on the topic of that specific dataset: I think Polars may have an issue with the first example as the lists are of mixed types.~

~You may need to manually transpose them (e.g. with zip):~

>>> pl.DataFrame(zip(*data["data"]), orient="col")
shape: (191_407, 28)
┌───────────────┬───────────────┬──────────┬────────────┬───┬─────────────┬───────────┬───────────┬───────────┐
│ column_0      ┆ column_1      ┆ column_2 ┆ column_3   ┆ … ┆ column_24   ┆ column_25 ┆ column_26 ┆ column_27 │
│ ---           ┆ ---           ┆ ---      ┆ ---        ┆   ┆ ---         ┆ ---       ┆ ---       ┆ ---       │
│ str           ┆ str           ┆ i64      ┆ i64        ┆   ┆ str         ┆ str       ┆ str       ┆ str       │
╞═══════════════╪═══════════════╪══════════╪════════════╪═══╪═════════════╪═══════════╪═══════════╪═══════════╡
│ row-d77h.fy7… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53033006500 ┆ 3009      ┆ 7         ┆ 4         │
│ row-95h2~76n… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53077002201 ┆ 2746      ┆ 4         ┆ 22        │
│ row-7cud~im3… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53033029306 ┆ 3009      ┆ 9         ┆ 46        │
│ row-3ey5-tfy… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53061052107 ┆ 3213      ┆ 1         ┆ 45        │
│ row-h6ea_9na… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53077000902 ┆ 2746      ┆ 4         ┆ 20        │
│ …             ┆ …             ┆ …        ┆ …          ┆ … ┆ …           ┆ …         ┆ …         ┆ …         │
│ row-j544~qpj… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53053073302 ┆ 3210      ┆ 10        ┆ 15        │
│ row-ertp~645… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53053070206 ┆ 3210      ┆ 8         ┆ 15        │
│ row-5wyy~zwx… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53033032601 ┆ 3009      ┆ 8         ┆ 6         │
│ row-mevd_k52… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53027001100 ┆ 2983      ┆ 6         ┆ 26        │
│ row-6n89~ue2… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53005010811 ┆ 2955      ┆ 4         ┆ 12        │
└───────────────┴───────────────┴──────────┴────────────┴───┴─────────────┴───────────┴───────────┴───────────┘

Does record_path do something differently compared to indexing the object directly?

>>> pl.json_normalize(data["meta"]["view"]["columns"]).columns
['id',
 'name',
 'dataTypeName',
 'fieldName',
 'position',
 'renderTypeName',
 'flags',
 'description',
 'tableColumnId',
 'format.align',
 'computationStrategy.source_columns',
 'computationStrategy.type',
 'computationStrategy.parameters.region',
 'computationStrategy.parameters.primary_key']
jrasband-dev commented 1 month ago

I didn't think of indexing the object directly. That definitely works when getting the column names.

I suppose the one advantage of the record_path argument in pandas is that when you specify the "data" key, it will correctly parse and read the data into a DataFrame. Whereas polars requires the use of the zip function that you mentioned and indexing doesn't work.

import json
import polars as pl 

f = open("../../Datasets/Electric_Vehicles.json",'r')
data = json.load(f)
df = pl.json_normalize(data["data"])
df

Output

shape: (28, 191_407)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ column_0  ┆ column_1  ┆ column_2  ┆ column_3  ┆ … ┆ column_19 ┆ column_19 ┆ column_19 ┆ column_1 │
│ ---       ┆ ---       ┆ ---       ┆ ---       ┆   ┆ 1403      ┆ 1404      ┆ 1405      ┆ 91406    │
│ struct[1] ┆ struct[1] ┆ struct[1] ┆ struct[1] ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆           ┆           ┆           ┆   ┆ struct[1] ┆ struct[1] ┆ struct[1] ┆ struct[1 │
│           ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆ ]        │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ {null}    ┆ {null}    ┆ {null}    ┆ {null}    ┆ … ┆ {null}    ┆ {null}    ┆ {null}    ┆ {null}   │
│ {null}    ┆ {null}    ┆ {null}    ┆ {null}    ┆ … ┆ {null}    ┆ {null}    ┆ {null}    ┆ {null}   │
│ {null}    ┆ {null}    ┆ {null}    ┆ {null}    ┆ … ┆ {null}    ┆ {null}    ┆ {null}    ┆ {null}   │
│ {null}    ┆ {null}    ┆ {null}    ┆ {null}    ┆ … ┆ {null}    ┆ {null}    ┆ {null}    ┆ {null}   │
│ {null}    ┆ {null}    ┆ {null}    ┆ {null}    ┆ … ┆ {null}    ┆ {null}    ┆ {null}    ┆ {null}   │
│ …         ┆ …         ┆ …         ┆ …         ┆ … ┆ …         ┆ …         ┆ …         ┆ …        │
│ {null}    ┆ {null}    ┆ {null}    ┆ {null}    ┆ … ┆ {null}    ┆ {null}    ┆ {null}    ┆ {null}   │
│ {null}    ┆ {null}    ┆ {null}    ┆ {null}    ┆ … ┆ {null}    ┆ {null}    ┆ {null}    ┆ {null}   │
│ {null}    ┆ {null}    ┆ {null}    ┆ {null}    ┆ … ┆ {null}    ┆ {null}    ┆ {null}    ┆ {null}   │
│ {null}    ┆ {null}    ┆ {null}    ┆ {null}    ┆ … ┆ {null}    ┆ {null}    ┆ {null}    ┆ {null}   │
│ {null}    ┆ {null}    ┆ {null}    ┆ {null}    ┆ … ┆ {null}    ┆ {null}    ┆ {null}    ┆ {null}   │
└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘

And using the zip function certainly isn't hard to do. Now that you've shared that code snippet it makes a lot of sense, but perhaps its more of an ease of use situation for users. I don't know that I would have known how to read the data in correct if you didn't share that with me.

cmdlineluser commented 1 month ago

Ah.. I see.

So without it, it behaves similar to the Polars example:

>>> pd.json_normalize(data["data"])
        0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27
0       {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}
1       {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}  {}
[191407 rows x 28 columns]

I was being silly with the zip:

>>> pl.DataFrame(data["data"], orient="row")
shape: (191_407, 28)
┌───────────────┬───────────────┬──────────┬────────────┬───┬─────────────┬───────────┬───────────┬───────────┐
│ column_0      ┆ column_1      ┆ column_2 ┆ column_3   ┆ … ┆ column_24   ┆ column_25 ┆ column_26 ┆ column_27 │
│ ---           ┆ ---           ┆ ---      ┆ ---        ┆   ┆ ---         ┆ ---       ┆ ---       ┆ ---       │
│ str           ┆ str           ┆ i64      ┆ i64        ┆   ┆ str         ┆ str       ┆ str       ┆ str       │
╞═══════════════╪═══════════════╪══════════╪════════════╪═══╪═════════════╪═══════════╪═══════════╪═══════════╡
│ row-d77h.fy7… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53033006500 ┆ 3009      ┆ 7         ┆ 4         │
│ row-95h2~76n… ┆ 00000000-000… ┆ 0        ┆ 1718719976 ┆ … ┆ 53077002201 ┆ 2746      ┆ 4         ┆ 22        │
└───────────────┴───────────────┴──────────┴────────────┴───┴─────────────┴───────────┴───────────┴───────────┘

So it seems record_path may also enforce what Polars calls orient="row"

jrasband-dev commented 1 month ago

Exactly. So I guess the question would be: is it worth building out a new parameter for json_normalize?

It looks like it works fine when you index the object directly and when you have mixed types you can use the "row" orient method. Seems to accomplish the same goal.