pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.82k stars 1.91k forks source link

Cannot add columns to an empty DataFrame in 1.7.0 #18700

Closed hanjinliu closed 1 month ago

hanjinliu commented 1 month ago

Checks

Reproducible example

df = pl.DataFrame([])
df.with_columns(pl.Series("new", [1]))

Output:

Cell In[1], line 2
      1 df = pl.DataFrame([])
----> 2 df.with_columns(pl.Series("new", [1]))

File ~\mambaforge\envs\mt\Lib\site-packages\polars\dataframe\frame.py:9141, in DataFrame.with_columns(self, *exprs, **named_exprs)
   8995 def with_columns(
   8996     self,
   8997     *exprs: IntoExpr | Iterable[IntoExpr],
   8998     **named_exprs: IntoExpr,
   8999 ) -> DataFrame:
   9000     """
   9001     Add columns to this DataFrame.
   9002 
   (...)
   9139     └─────┴──────┴─────────────┘
   9140     """
-> 9141     return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)

File ~\mambaforge\envs\mt\Lib\site-packages\polars\lazyframe\frame.py:2034, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2032 # Only for testing purposes
   2033 callback = _kwargs.get("post_opt_callback", callback)
-> 2034 return wrap_df(ldf.collect(callback))

InvalidOperationError: Series new, length 1 doesn't match the DataFrame height of 0

If you want this Series to be broadcasted, ensure it is a scalar (for instance by adding '.first()').

Log output

No response

Issue description

Probably related to #18686, as it raises the same error. The same code worked with polars=1.6.0 but not with polars=1.7.0

Expected behavior

df.with_columns(pl.Series("new", [1])) should return a DataFrame with a single row/column.

Installed versions

``` --------Version info--------- Polars: 1.7.0 Index type: UInt32 Platform: Windows-10-10.0.22631-SP0 Python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:27:10) [MSC v.1938 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle 3.0.0 connectorx deltalake fastexcel 0.10.4 fsspec 2024.5.0 gevent great_tables matplotlib 3.9.0 nest_asyncio 1.6.0 numpy 2.1.0 openpyxl pandas 2.2.2 pyarrow 16.1.0 pydantic 1.10.16 pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
ritchie46 commented 1 month ago

This is correct. The DataFrame has length 0, so you cannot add a column of length 1.

You can only add literal/scalars of a different length, and they will be broadcasted to the DataFrame. (Currently there is a bug for empty frames we soon fix)

eitsupi commented 1 month ago

More odd behavior observed, see #18736.