pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.23k stars 1.95k forks source link

`read_json` should be independent of json attribute order #14415

Open tommyhe6 opened 9 months ago

tommyhe6 commented 9 months ago

Checks

Reproducible example

import io

import polars as pl

j1 = """
{
  "columns": [
    {
      "name": "name",
      "datatype": "String",
      "bit_settings": "",
      "values": [
        "Alice"
      ]
    }
  ]
}
"""

j2 = """
{
  "columns": [
    {
      "name": "name",
      "bit_settings": "",
      "values": [
        "Alice"
      ],
      "datatype": "String"
    }
  ]
}
"""

df1 = pl.read_json(io.StringIO(j1))
df2 = pl.read_json(io.StringIO(j2))
print(df1, df2)
print(df1.equals(df2))

Log output

shape: (1, 1)
┌───────┐
│ name  │
│ ---   │
│ str   │
╞═══════╡
│ Alice │
└───────┘ shape: (1, 1)
┌──────────────────────────────────┐
│ columns                          │
│ ---                              │
│ list[struct[4]]                  │
╞══════════════════════════════════╡
│ [{"name","",["Alice"],"String"}] │
└──────────────────────────────────┘
False

Issue description

read_json behavior should be independent of the ordering of the attributes under most common definition of json. Notably, this caused problems when serializing to json, writing the json to some database (postgres in my case), then writing back to a polars df.

Expected behavior

Both df would be the same, specifically the first.

Installed versions

``` --------Version info--------- Polars: 0.20.7 Index type: UInt32 Platform: macOS-13.5.2-arm64-arm-64bit Python: 3.11.6 (main, Oct 2 2023, 13:45:54) [Clang 16.0.6 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: numpy: 1.26.4 openpyxl: pandas: 2.2.0 pyarrow: 15.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
ritchie46 commented 9 months ago

I think we should allow for a keyword argument that dictates to order the fields alphabetically. As for our structs order does matter.

taki-mekhalfa commented 9 months ago

I think we should allow for a keyword argument that dictates to order the fields alphabetically. As for our structs order does matter.

Some other tools can mess up the order, just like Postgres as the OP has mentioned

mcrumiller commented 9 months ago

The fundament issue is that just because two jsons are equal, that does not mean that their dataframe representation is equal. This is true in any case in which one framework contains more information than the other. Polars DataFrames contain information about the order of their columns. See this on the main page of json.org:

An object is an unordered set of name/value pairs.

One way to solve this problem is to always ensure the columns from a json file are represented by polars in alphabetical order as @ritchie46 suggested. This may confuse people with simple json files when they find their schema has been reordered (if we default it to True as I think perhaps we should), but a quick sentence in the API would explain that.