pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.56k stars 1.98k forks source link

`infer_schema_length=None` fails with unexpected type for nested data #16607

Open theelderbeever opened 6 months ago

theelderbeever commented 6 months ago

Checks

Reproducible example

example_data = [
    {
        "customer": "customer_1",
        "summaries": [
            {
                "id": "summary_1",
                "object": "object_1",
                "aggregated_value": 1000.0,
                "end_time": 1625155200,
                "livemode": True,
                "meter": "meter_1",
                "start_time": 1625078800,
            },
            {
                "id": "summary_2",
                "object": "object_2",
                "aggregated_value": 2000,
                "end_time": 1625241600,
                "livemode": False,
                "meter": "meter_2",
                "start_time": 1625165200,
            }
        ]
    },
    {
        "customer": "customer_2",
        "summaries": [
            {
                "id": "summary_3",
                "object": "object_3",
                "aggregated_value": 3000,
                "end_time": 1625328000,
                "livemode": True,
                "meter": "meter_3",
                "start_time": 1625251600,
            }
        ]
    }
]

pl.DataFrame(example_data, infer_schema_length=None)

Log output

❯ POLARS_VERBOSE=1 python notebooks/test.py
Traceback (most recent call last):
  File "/Users/taylorbeever/git/quiknode-labs/billing/billing-platform-pipelines/notebooks/test.py", line 43, in <module>
    pl.DataFrame(example_data, infer_schema_length=None)
  File "/Users/taylorbeever/.pyenv/versions/billing-platform-pipelines/lib/python3.11/site-packages/polars/dataframe/frame.py", line 366, in __init__
    self._df = sequence_to_pydf(
               ^^^^^^^^^^^^^^^^^
  File "/Users/taylorbeever/.pyenv/versions/billing-platform-pipelines/lib/python3.11/site-packages/polars/_utils/construction/dataframe.py", line 437, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/taylorbeever/.pyenv/versions/3.11.8/lib/python3.11/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/taylorbeever/.pyenv/versions/billing-platform-pipelines/lib/python3.11/site-packages/polars/_utils/construction/dataframe.py", line 678, in _sequence_of_dict_to_pydf
    pydf = PyDataFrame.from_dicts(
           ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unexpected value while building Series of type Float64; found value of type Int64: 2000

Hint: Try setting `strict=False` to allow passing data with mixed types.

Issue description

Polars fails to correctly infer the datatype of a nested struct even with infer_schema_length=None. The column in the example that is failing is the aggregated_value field in the List(Struct( ... )).

Expected behavior

infer_schema_length should apply to nested types as well.

Installed versions

``` --------Version info--------- Polars: 0.20.30 Index type: UInt32 Platform: macOS-14.4.1-arm64-arm-64bit Python: 3.11.8 (main, Apr 27 2024, 07:50:56) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.2.1 connectorx: deltalake: fastexcel: fsspec: 2023.12.2 gevent: hvplot: 0.9.2 matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: pydantic: 2.5.3 pyiceberg: pyxlsb: sqlalchemy: 2.0.29 torch: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 5 months ago

Can reproduce.

pl.DataFrame({"A": [[[1.0], [2]]]})
# shape: (1, 1)
# ┌─────────────────┐
# │ A               │
# │ ---             │
# │ list[list[f64]] │
# ╞═════════════════╡
# │ [[1.0], [2.0]]  │
# └─────────────────┘
pl.DataFrame({"A": [[{"B":1.0}, {"B":2}]]})
TypeError: unexpected value while building Series of type Float64; found value of type Int64: 2

INFER_SCHEMA_LENGTH is hardcoded to 25 here, but it doesn't seem to come into play:

The issue seems to be that structs are treated differently to other types.

e.g. inside to_list there is an explicit cast:

But to_struct ends up calling from_any_values_and_dtype again on the inner values:

So in this case, we end up with a strict call on the inner values that fails.

Series::from_any_values_and_dtype("name", [1.0, 2], Float64, true)