pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.88k stars 1.92k forks source link

Panic during broadcasting struct series containing datetime type #19277

Closed thomasaarholt closed 4 days ago

thomasaarholt commented 4 days ago

Checks

Reproducible example

import polars as pl
from datetime import datetime

s_foo = pl.Series("foo", [1,2,]) # height 2
s_bar = pl.Series("bar", [{"datetime": datetime(2024, 1, 1)}]) # struct, height 1

# this worked in polars 1.6:
pl.DataFrame().with_columns([s_foo, s_bar])
# shape: (2, 2)
# ┌─────┬───────────────────────┐
# │ foo ┆ bar                   │
# │ --- ┆ ---                   │
# │ i64 ┆ struct[1]             │
# ╞═════╪═══════════════════════╡
# │ 1   ┆ {2024-01-01 00:00:00} │
# │ 2   ┆ {2024-01-01 00:00:00} │
# └─────┴───────────────────────┘

# error with polars 1.7+
thread '<unnamed>' panicked at crates/polars-core/src/scalar/mod.rs:46:92:
called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("unexpected value while building Series of type Datetime(Microseconds, None); found value of type Datetime(Microseconds, None): 2024-01-01 00:00:00"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-6-47202b4391ba> in ?()
----> 1 pl.DataFrame().with_columns([s_foo, s_bar2])

~/repos/patito/.venv/lib/python3.12/site-packages/decorator.py in ?(*args, **kw)
    229         def fun(*args, **kw):
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)

~/repos/patito/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py in ?(self)
   1177     def __repr__(self) -> str:
-> 1178         return self.__str__()

~/repos/patito/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py in ?(self)
   1174     def __str__(self) -> str:
-> 1175         return self._df.as_str()

PanicException: called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("unexpected value while building Series of type Datetime(Microseconds, None); found value of type Datetime(Microseconds, None): 2024-01-01 00:00:00"))

Issue description

We discovered this one at patito, which is a dataframe validation library for polars. We have a .examples() method, which constructs single or multiple row example series of a given schema, which then a concatenated using with_columns.

Broadcasting a series of length 1 works fine using ints, datetimes etc. But not with structs containing datetimes as per the repro above.

Using other types works fine:

s_foo = pl.Series("foo", [1,2,])
s_bar = pl.Series("bar", [datetime(2024, 1, 1)])
s_baz = pl.Series("baz", [2.0])
pl.DataFrame().with_columns([s_foo, s_bar, s_baz])
shape: (2, 3)
# ┌─────┬─────────────────────┬─────┐
# │ foo ┆ bar                 ┆ baz │
# │ --- ┆ ---                 ┆ --- │
# │ i64 ┆ datetime[μs]        ┆ f64 │
# ╞═════╪═════════════════════╪═════╡
# │ 1   ┆ 2024-01-01 00:00:00 ┆ 2.0 │
# │ 2   ┆ 2024-01-01 00:00:00 ┆ 2.0 │
# └─────┴─────────────────────┴─────┘

Using a struct of ints and floats works fine:

s_foo = pl.Series("foo", [1,2,])
s_bar = pl.Series("bar", [{"a":1, "b":2.0}])
pl.DataFrame().with_columns([s_foo, s_bar])
# shape: (2, 3)
# ┌─────┬───────────┬─────┐
# │ foo ┆ bar       ┆ baz │
# │ --- ┆ ---       ┆ --- │
# │ i64 ┆ struct[2] ┆ f64 │
# ╞═════╪═══════════╪═════╡
# │ 1   ┆ {1,2.0}   ┆ 2.0 │
# │ 2   ┆ {1,2.0}   ┆ 2.0 │
# └─────┴───────────┴─────┘

Broadcasting with structs using with_columns used to work in polars 1.6. For the MWE example above, in 1.7.1, the following error is returned:

---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
Cell In[1], line 8
      5 s_bar = pl.Series("bar", [{"datetime": datetime(2024, 1, 1)}]) # struct, height 1
      7 # this worked in polars 1.6:
----> 8 pl.DataFrame().with_columns([s_foo, s_bar])

File ~/codes/polars-struct/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py:9141, in DataFrame.with_columns(self, *exprs, **named_exprs)
   8995 def with_columns(
   8996     self,
   8997     *exprs: IntoExpr | Iterable[IntoExpr],
   8998     **named_exprs: IntoExpr,
   8999 ) -> DataFrame:
   9000     """
   9001     Add columns to this DataFrame.
   9002
   (...)
   9139     └─────┴──────┴─────────────┘
   9140     """
-> 9141     return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)

File ~/codes/polars-struct/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2032, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2030 # Only for testing purposes
   2031 callback = _kwargs.get("post_opt_callback", callback)
-> 2032 return wrap_df(ldf.collect(callback))

InvalidOperationError: Series bar, length 1 doesn't match the DataFrame height of 0

If you want this Series to be broadcasted, ensure it is a scalar (for instance by adding '.first()').

Expected behavior

I'd expect the old behaviour as per the commented out section in the MWE.

Installed versions

``` pl.show_versions() --------Version info--------- Polars: 1.9.0 Index type: UInt32 Platform: macOS-15.0.1-arm64-arm-64bit Python: 3.12.4 (main, Jul 25 2024, 22:11:22) [Clang 18.1.8 ] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib nest_asyncio 1.6.0 numpy 2.1.2 openpyxl pandas 2.2.3 pyarrow 17.0.0 pydantic 2.9.2 pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
cmdlineluser commented 4 days ago

This does run again on main:

>>> pl.DataFrame().with_columns([s_foo, s_bar])
shape: (2, 2)
┌─────┬───────────────────────┐
│ foo ┆ bar                   │
│ --- ┆ ---                   │
│ i64 ┆ struct[1]             │
╞═════╪═══════════════════════╡
│ 1   ┆ {2024-01-01 00:00:00} │
│ 2   ┆ {2024-01-01 00:00:00} │
└─────┴───────────────────────┘

It seems it was fixed by https://github.com/pola-rs/polars/pull/19148

thomasaarholt commented 4 days ago

Ah! That’s the best case! Thanks!