pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.2k stars 1.95k forks source link

Regression 0.20.15->0.20.16: ComputeError: conversion from `null` to `struct[100]` failed in column 'literal' for 0 out of 1 values #15476

Open antonioalegria opened 7 months ago

antonioalegria commented 7 months ago

Checks

Reproducible example

import polars as pl

def _unnest_list_columns(df, list_columns):
        new_columns = []
        for col in list_columns:
            new_column = pl.when((pl.col(col).is_not_null()) & (pl.col(col).list.len() > 0)).then(pl.col(col).list.to_struct("max_width", lambda x: f"{x}", 100)).otherwise(pl.lit(None)).alias(col) # This doesn't work with empty lists
            new_columns.append(new_column)

        return df.with_columns(new_columns)

df1 = pl.DataFrame(
    {"a": [1, 2, 3],
     "b": [[{"a": 1}], [{"a": 1}, {"a": 2}], [{"a": 1}, {"a": 2}, {"a": 3}]]
     }
    )

print(_unnest_list_columns(df1, ["b"])) # ComputeError: conversion from `null` to `struct[100]` failed in column 'literal' for 0 out of 1 values

Log output

Traceback (most recent call last):
  File "/Users/antonioalegria/Developer/hyperml/x.py", line 17, in <module>
    _unnest_list_columns(df1, ["b"]) # Boom!
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/antonioalegria/Developer/hyperml/x.py", line 9, in _unnest_list_columns
    return df.with_columns(new_columns)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/polars/dataframe/frame.py", line 8366, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1943, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
polars.exceptions.ComputeError: conversion from `null` to `struct[100]` failed in column 'literal' for 0 out of 1 values: []

Issue description

In 0.20.15 this ran without any issues, now it raises this exception.

Expected behavior

It should run as in 0.20.15, unless I need to migrate some code, printing the following:

shape: (3, 2)
┌─────┬─────────────────────┐
│ a   ┆ b                   │
│ --- ┆ ---                 │
│ i64 ┆ struct[3]           │
╞═════╪═════════════════════╡
│ 1   ┆ {{1},{null},{null}} │
│ 2   ┆ {{1},{2},{null}}    │
│ 3   ┆ {{1},{2},{3}}       │
└─────┴─────────────────────┘

Installed versions

``` --------Version info--------- Polars: 0.20.16 Index type: UInt32 Platform: macOS-14.3.1-arm64-arm-64bit Python: 3.11.6 (main, Oct 2 2023, 20:46:14) [Clang 14.0.3 (clang-1403.0.22.14.1)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.2.1 connectorx: deltalake: fastexcel: fsspec: 2023.6.0 gevent: hvplot: matplotlib: 3.7.1 numpy: 1.24.3 openpyxl: 3.1.2 pandas: 1.5.3 pyarrow: 12.0.1 pydantic: 1.10.9 pyiceberg: pyxlsb: sqlalchemy: 2.0.18 xlsx2csv: xlsxwriter: ```
cmdlineluser commented 7 months ago

Can reproduce the error.

On 0.20.15 I get this:

df1 = pl.DataFrame({
    "a": [1, 2, 3, 4, 5],
    "b": [[{"a": 1}], [{"a": 1}, {"a": 2}], [{"a": 1}, {"a": 2}, {"a": 3}], [], None]
})

pl.__version__

df1.with_columns(
    pl.when((pl.col("b").is_not_null()) & (pl.col("b").list.len() > 0))
      .then(pl.col("b").list.to_struct("max_width", lambda x: f"{x}", 100))
)

# '0.20.15'
# shape: (5, 2)
# ┌─────┬────────────────────────┐
# │ a   ┆ b                      │
# │ --- ┆ ---                    │
# │ i64 ┆ struct[3]              │
# ╞═════╪════════════════════════╡
# │ 1   ┆ {{1},{null},{null}}    │
# │ 2   ┆ {{1},{2},{null}}       │
# │ 3   ┆ {{1},{2},{3}}          │
# │ 4   ┆ {{null},{null},{null}} │
# │ 5   ┆ {{null},{null},{null}} │
# └─────┴────────────────────────┘

Does the .when() actually do anything in this case?

df1.with_columns(
    pl.col("b").list.to_struct("max_width", lambda x: f"{x}", 100)
)

# shape: (5, 2)
# ┌─────┬────────────────────────┐
# │ a   ┆ b                      │
# │ --- ┆ ---                    │
# │ i64 ┆ struct[3]              │
# ╞═════╪════════════════════════╡
# │ 1   ┆ {{1},{null},{null}}    │
# │ 2   ┆ {{1},{2},{null}}       │
# │ 3   ┆ {{1},{2},{3}}          │
# │ 4   ┆ {{null},{null},{null}} │
# │ 5   ┆ {{null},{null},{null}} │
# └─────┴────────────────────────┘
reswqa commented 7 months ago

Thanks @antonioalegria and @cmdlineluser. This should have been an issue for some time, but type_coercion for when-then-otherwise was changed to strict_cast in 0.20.16, the culprit was revealed then. But yes, we should fix this.

reswqa commented 7 months ago

After some discussion, I think this should be fixed if we enable outer validity for StructChunked, see #3462.

Until then, you may need to set type_coercion=False to workaround.

antonioalegria commented 6 months ago

Where should I set type_coercion=False?