pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.87k stars 1.71k forks source link

ComputeError: this expression cannot run in the group_by context #17356

Open AtollRewe opened 5 days ago

AtollRewe commented 5 days ago

Checks

Reproducible example

df = pl.DataFrame(
    {
        "X": [
            pl.Series(
                "Y",
                [
                    {"a": 1, "b": 2},
                    {"a": 3, "b": 4}
                ]
            ),
        ]
    }
)
df["X"].list.eval(pl.element().struct.with_fields())

Log output

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[17], line 15
      1 os.environ["POLARS_VERBOSE"] = "1"
      2 df = pl.DataFrame(
      3     {
      4         "X": [
   (...)
     13     }
     14 )
---> 15 df["X"].list.eval(pl.element().struct.with_fields(c=pl.element().struct.field("a") + 1))

File ~/git/MOSAIC/ml-pipelines/.venv11/lib/python3.11/site-packages/polars/series/utils.py:107, in call_expr.<locals>.wrapper(self, *args, **kwargs)
    105     expr = getattr(expr, namespace)
    106 f = getattr(expr, func.__name__)
--> 107 return s.to_frame().select_seq(f(*args, **kwargs)).to_series()

File ~/git/MOSAIC/ml-pipelines/.venv11/lib/python3.11/site-packages/polars/dataframe/frame.py:8615, in DataFrame.select_seq(self, *exprs, **named_exprs)
   8592 def select_seq(
   8593     self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
   8594 ) -> DataFrame:
   8595     """
   8596     Select columns from this DataFrame.
   8597 
   (...)
   8613     select
   8614     """
-> 8615     return self.lazy().select_seq(*exprs, **named_exprs).collect(_eager=True)

File ~/git/MOSAIC/ml-pipelines/.venv11/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

ComputeError: this expression cannot run in the group_by context

Error originated in expression: 'col("").with_fields([[(col("").struct.field_by_name(a)()) + (1)].alias("c")])'

Issue description

There is no group_by in my user code, so the error messages seems cryptic.

The MRE shows a call to with_fields with no arguments, which should be a noop, but of course the error also exists when arguments are given.

Expected behavior

In the MRE above, the expected behavior would the original created DataFrame (or a copy thereof).

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: macOS-14.2.1-arm64-arm-64bit Python: 3.11.7 (v3.11.7:fa7a6f2303, Dec 4 2023, 15:22:56) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2024.6.0 gevent: great_tables: hvplot: matplotlib: 3.9.0 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 15.0.2 pydantic: 2.7.4 pyiceberg: sqlalchemy: 2.0.30 torch: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 5 days ago

I think list.eval() gets converted into a group_by internally which is probably where the cryptic error comes from.

It seems pl.field() is also invalid inside list.eval()

df.select(pl.col.X.list.eval(pl.field("a")))
# InvalidOperationError: field expression not allowed at location/context

Perhaps they could both be allowed.