pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.6k stars 1.89k forks source link

read_json unable to parse json list of lists, and doesn't respect documented schema arguments #16911

Open NSchrading opened 3 months ago

NSchrading commented 3 months ago

Checks

Reproducible example


import polars as pl
from io import StringIO

example_json = """
[
["2024-05-15T23:59:00Z",-0.512,null,null,null,"a"],
["2024-05-15T23:59:01Z",-0.5,null,null,null,"a"],
["2024-05-15T23:59:00Z",0.4,null,null,null,"b"]
]
"""

pl.read_json(StringIO(example_json))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\NSchr\Programming\temp\venv\lib\site-packages\polars\io\json.py", line 94, in read_json
    pydf = PyDataFrame.read_json(
RuntimeError: BindingsError: "can only deserialize json objects"

# change value array, make one of the other fields non-null, same error
example_json = """
[
["2024-05-15T23:59:00Z",-0.512,null,null,null,"a"],
["2024-05-15T23:59:01Z",-0.5,null,null,null,"a"],
["2024-05-15T23:59:00Z",null,true,null,null,"b"]
]
"""
pl.read_json(StringIO(example_json))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\NSchr\Programming\temp\venv\lib\site-packages\polars\io\json.py", line 94, in read_json
    pydf = PyDataFrame.read_json(
RuntimeError: BindingsError: "can only deserialize json objects"

# remove nulls. same error
example_json = """
[
["2024-05-15T23:59:00Z",-0.512,"a"],
["2024-05-15T23:59:01Z",-0.5,"a"],
["2024-05-15T23:59:00Z",true,"b"]
]
"""
pl.read_json(StringIO(example_json))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\NSchr\Programming\temp\venv\lib\site-packages\polars\io\json.py", line 94, in read_json
    pydf = PyDataFrame.read_json(
RuntimeError: BindingsError: "can only deserialize json objects"

# try with a schema for the column names
# fails despite documentation saying you can pass a list of column names
pl.read_json(StringIO(example_json), schema=["col1", "col2", "col3"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\NSchr\Programming\temp\venv\lib\site-packages\polars\io\json.py", line 94, in read_json
    pydf = PyDataFrame.read_json(
TypeError: argument 'schema': 'list' object cannot be converted to 'PyDict'

# try with ndjson
example_json = """
["2024-05-15T23:59:00Z",-0.512,null,null,null,"a"]
["2024-05-15T23:59:01Z",-0.5,null,null,null,"a"]
["2024-05-15T23:59:00Z",null,true,null,null,"b"]
"""

pl.read_ndjson(StringIO(example_json))
"""
thread '<unnamed>' panicked at crates\polars-arrow\src\array\struct_\mod.rs:239:41:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("Struct array must be created with a DataType whose physical type is Struct"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\NSchr\Programming\temp\venv\lib\site-packages\polars\io\ndjson.py", line 77, in read_ndjson
    pydf = PyDataFrame.read_ndjson(
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("Struct array must be created with a DataType whose physical type is Struct"))

Log output

No response

Issue description

Polars is unable to parse a valid json list of lists. See the reproducible examples. It also doesn't support a schema with a list of column names despite the documentation saying so: https://docs.pola.rs/api/python/stable/reference/api/polars.read_json.html:

As a list of column names; in this case types are automatically inferred.

Expected behavior

Expected behavior is something like pandas:


import pandas as pd

example_json = """
[
["2024-05-15T23:59:00Z",-0.512,null,null,null,"a"],
["2024-05-15T23:59:01Z",-0.5,null,null,null,"a"],
["2024-05-15T23:59:00Z",0.4,null,null,null,"b"]
]
"""

pd.read_json(StringIO(example_json))
                      0      1   2   3   4  5
0  2024-05-15T23:59:00Z -0.512 NaN NaN NaN  a
1  2024-05-15T23:59:01Z -0.500 NaN NaN NaN  a
2  2024-05-15T23:59:00Z  0.400 NaN NaN NaN  b

If provided a schema with column names they would be set appropriately instead of 0-5.

Installed versions

``` >>> pl.show_versions() --------Version info--------- Polars: 0.20.31 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: nest_asyncio: numpy: 1.24.2 openpyxl: pandas: 2.0.0 pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
stinodego commented 3 months ago

I don't think that's valid JSON... :thinking: The error messages definitely aren't great here though.

NSchrading commented 3 months ago

I wouldn't expect python's builtin json parser or pandas to parse it if it wasn't valid json:

>>> example_json = """
... [
... ["2024-05-15T23:59:00Z",-0.512,null,null,null,"a"],
... ["2024-05-15T23:59:01Z",-0.5,null,null,null,"a"],
... ["2024-05-15T23:59:00Z",0.4,null,null,null,"b"]
... ]
... """
>>>
>>> json.loads(example_json)
[['2024-05-15T23:59:00Z', -0.512, None, None, None, 'a'], ['2024-05-15T23:59:01Z', -0.5, None, None, None, 'a'], ['2024-05-15T23:59:00Z', 0.4, None, None, None, 'b']]

It also passes online json lint/validators, e.g. https://jsonlint.com/. I believe according to the json spec these are valid json arrays:

JSON is built on two structures:

A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

https://www.json.org/json-en.html

cmdlineluser commented 3 months ago

It is valid json, but it is problematic.

Pandas appears to explode the list and then "unnest" the values.

>>> pd.read_json(StringIO(example_json))
#                       0      1   2   3   4  5
# 0  2024-05-15T23:59:00Z -0.512 NaN NaN NaN  a
# 1  2024-05-15T23:59:01Z -0.500 NaN NaN NaN  a
# 2  2024-05-15T23:59:00Z  0.400 NaN NaN NaN  b

If we use .str.json_decode() - we can see it is parsed as valid:

>>> pl.select(pl.lit(example_json).str.json_decode())
# shape: (1, 1)
# ┌─────────────────────────────────┐
# │ literal                         │
# │ ---                             │
# │ list[list[str]]                 │
# ╞═════════════════════════════════╡
# │ [["2024-05-15T23:59:00Z", "-0.… │
# └─────────────────────────────────┘

But as lists are homogeneous in Polars, everything has been coerced into Strings.

>>> (pl.select(pl.lit(example_json).str.json_decode().flatten())
...    .select(cols = pl.all().list.to_struct("max_width"))
...    .unnest("cols")
... )
# shape: (3, 6)
# ┌──────────────────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
# │ field_0              ┆ field_1 ┆ field_2 ┆ field_3 ┆ field_4 ┆ field_5 │
# │ ---                  ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
# │ str                  ┆ str     ┆ str     ┆ str     ┆ str     ┆ str     │
# ╞══════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
# │ 2024-05-15T23:59:00Z ┆ -0.512  ┆ null    ┆ null    ┆ null    ┆ a       │
# │ 2024-05-15T23:59:01Z ┆ -0.5    ┆ null    ┆ null    ┆ null    ┆ a       │
# │ 2024-05-15T23:59:00Z ┆ 0.4     ┆ null    ┆ null    ┆ null    ┆ b       │
# └──────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

It seems that when reading from a file, only JSON objects are considered valid?