Open NSchrading opened 3 months ago
I don't think that's valid JSON... :thinking: The error messages definitely aren't great here though.
I wouldn't expect python's builtin json parser or pandas to parse it if it wasn't valid json:
>>> example_json = """
... [
... ["2024-05-15T23:59:00Z",-0.512,null,null,null,"a"],
... ["2024-05-15T23:59:01Z",-0.5,null,null,null,"a"],
... ["2024-05-15T23:59:00Z",0.4,null,null,null,"b"]
... ]
... """
>>>
>>> json.loads(example_json)
[['2024-05-15T23:59:00Z', -0.512, None, None, None, 'a'], ['2024-05-15T23:59:01Z', -0.5, None, None, None, 'a'], ['2024-05-15T23:59:00Z', 0.4, None, None, None, 'b']]
It also passes online json lint/validators, e.g. https://jsonlint.com/. I believe according to the json spec these are valid json arrays:
JSON is built on two structures:
A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
It is valid json, but it is problematic.
Pandas appears to explode the list and then "unnest" the values.
>>> pd.read_json(StringIO(example_json))
# 0 1 2 3 4 5
# 0 2024-05-15T23:59:00Z -0.512 NaN NaN NaN a
# 1 2024-05-15T23:59:01Z -0.500 NaN NaN NaN a
# 2 2024-05-15T23:59:00Z 0.400 NaN NaN NaN b
If we use .str.json_decode()
- we can see it is parsed as valid:
>>> pl.select(pl.lit(example_json).str.json_decode())
# shape: (1, 1)
# ┌─────────────────────────────────┐
# │ literal │
# │ --- │
# │ list[list[str]] │
# ╞═════════════════════════════════╡
# │ [["2024-05-15T23:59:00Z", "-0.… │
# └─────────────────────────────────┘
But as lists are homogeneous in Polars, everything has been coerced into Strings.
>>> (pl.select(pl.lit(example_json).str.json_decode().flatten())
... .select(cols = pl.all().list.to_struct("max_width"))
... .unnest("cols")
... )
# shape: (3, 6)
# ┌──────────────────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
# │ field_0 ┆ field_1 ┆ field_2 ┆ field_3 ┆ field_4 ┆ field_5 │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ str ┆ str ┆ str │
# ╞══════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
# │ 2024-05-15T23:59:00Z ┆ -0.512 ┆ null ┆ null ┆ null ┆ a │
# │ 2024-05-15T23:59:01Z ┆ -0.5 ┆ null ┆ null ┆ null ┆ a │
# │ 2024-05-15T23:59:00Z ┆ 0.4 ┆ null ┆ null ┆ null ┆ b │
# └──────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
It seems that when reading from a file, only JSON objects are considered valid?
Checks
Reproducible example
Log output
No response
Issue description
Polars is unable to parse a valid json list of lists. See the reproducible examples. It also doesn't support a schema with a list of column names despite the documentation saying so: https://docs.pola.rs/api/python/stable/reference/api/polars.read_json.html:
Expected behavior
Expected behavior is something like pandas:
If provided a schema with column names they would be set appropriately instead of 0-5.
Installed versions