`read_json` should be independent of json attribute order

tommyhe6 commented 9 months ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import io

import polars as pl

j1 = """
{
  "columns": [
    {
      "name": "name",
      "datatype": "String",
      "bit_settings": "",
      "values": [
        "Alice"
      ]
    }
  ]
}
"""

j2 = """
{
  "columns": [
    {
      "name": "name",
      "bit_settings": "",
      "values": [
        "Alice"
      ],
      "datatype": "String"
    }
  ]
}
"""

df1 = pl.read_json(io.StringIO(j1))
df2 = pl.read_json(io.StringIO(j2))
print(df1, df2)
print(df1.equals(df2))

Log output

shape: (1, 1)
┌───────┐
│ name  │
│ ---   │
│ str   │
╞═══════╡
│ Alice │
└───────┘ shape: (1, 1)
┌──────────────────────────────────┐
│ columns                          │
│ ---                              │
│ list[struct[4]]                  │
╞══════════════════════════════════╡
│ [{"name","",["Alice"],"String"}] │
└──────────────────────────────────┘
False

Issue description

read_json behavior should be independent of the ordering of the attributes under most common definition of json. Notably, this caused problems when serializing to json, writing the json to some database (postgres in my case), then writing back to a polars df.

Expected behavior

Both df would be the same, specifically the first.

Installed versions

``` --------Version info--------- Polars: 0.20.7 Index type: UInt32 Platform: macOS-13.5.2-arm64-arm-64bit Python: 3.11.6 (main, Oct 2 2023, 13:45:54) [Clang 16.0.6 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: numpy: 1.26.4 openpyxl: pandas: 2.2.0 pyarrow: 15.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```

ritchie46 commented 9 months ago

I think we should allow for a keyword argument that dictates to order the fields alphabetically. As for our structs order does matter.

taki-mekhalfa commented 9 months ago

I think we should allow for a keyword argument that dictates to order the fields alphabetically. As for our structs order does matter.

Some other tools can mess up the order, just like Postgres as the OP has mentioned

mcrumiller commented 9 months ago

The fundament issue is that just because two jsons are equal, that does not mean that their dataframe representation is equal. This is true in any case in which one framework contains more information than the other. Polars DataFrames contain information about the order of their columns. See this on the main page of json.org:

An object is an unordered set of name/value pairs.

One way to solve this problem is to always ensure the columns from a json file are represented by polars in alphabetical order as @ritchie46 suggested. This may confuse people with simple json files when they find their schema has been reordered (if we default it to True as I think perhaps we should), but a quick sentence in the API would explain that.

pola-rs / polars