List dtype error with `list.to_array` in LazyFrame

Wouittone commented 4 months ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

(
    pl.LazyFrame({"id": [1, 1, 2, 2, 2, 3, 3, 3, 3], "values": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
      .group_by("id")
      .agg(pl.concat([pl.col("values")]).alias("lists"))
      .select(pl.col("lists").list.gather([0, 1]).list.to_array(2).alias("lists_start"))
      .collect()
)

Log output

``` --------------------------------------------------------------------------- ComputeError Traceback (most recent call last) Cell In[79], line 6 1 ( 2 pl.LazyFrame({"id": [1, 1, 2, 2, 2, 3, 3, 3, 3], "values": [1, 2, 3, 4, 5, 6, 7, 8, 9]}) 3 .group_by("id") 4 .agg(pl.concat([pl.col("values")]).alias("lists")) 5 .select(pl.col("lists").list.gather([0, 1]).list.to_array(2).alias("lists_start")) ----> 6 .collect() 7 ) File ~/.micromamba/envs/myenv/lib/python3.12/site-packages/polars/lazyframe/frame.py:1934, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager) 1931 if background: 1932 return InProcessQuery(ldf.collect_concurrently()) -> 1934 return wrap_df(ldf.collect()) ComputeError: expected List dtype Error originated just after this operation: AGGREGATE [col("values").concat_expr().alias("lists")] BY [col("id")] FROM DF ["id", "values"]; PROJECT */2 COLUMNS; SELECTION: "None" ```

Issue description

This is a strange error where applying .list.to_array in a LazyFrame after a grouping operation that concatenates some values into a list seems to wrongly infer the column type (at least that is what I understand from the ComputeError: expected List dtype error message.)

Running the example without the array cast works and the column dtype is a pl.List:

(
    pl.LazyFrame({"id": [1, 1, 2, 2, 2, 3, 3, 3, 3], "values": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
      .group_by("id")
      .agg(pl.concat([pl.col("values")]).alias("lists"))
      .select(pl.col("lists").list.gather([0, 1]).alias("lists_start"))
      .collect()
)

# shape: (3, 1)
# ┌─────────────┐
# │ lists_start │
# │ ---         │
# │ list[i64]   │
# ╞═════════════╡
# │ [1, 2]      │
# │ [6, 7]      │
# │ [3, 4]      │
# └─────────────┘

Expected behavior

This seems to work fine for eager DataFrame:

(
    pl.DataFrame({"id": [1, 1, 2, 2, 2, 3, 3, 3, 3], "values": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
      .group_by("id")
      .agg(pl.concat([pl.col("values")]).alias("lists"))
      .select(pl.col("lists").list.gather([0, 1]).list.to_array(2).alias("lists_start"))
)

And also for a LazyFrame without a groupby:

(
    pl.LazyFrame({"id": [1, 2, 3], "lists": [[1, 2], [3, 4, 5], [6, 7, 8, 9]]})
      .select(pl.col("lists").list.gather([0, 1]).list.to_array(2).alias("lists_start"))
      .collect()
)

I was expecting the LazyFrame with a grouby operation to work just as those two.

Installed versions

``` --------Version info--------- Polars: 0.20.19 Index type: UInt32 Platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.39 Python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.2.0 gevent: hvplot: 0.9.2 matplotlib: 3.8.3 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.1 pydantic: 2.6.3 pyiceberg: pyxlsb: sqlalchemy: 2.0.28 xlsx2csv: xlsxwriter: ```

reswqa commented 4 months ago

The key point is that the schema of FunctionExpr in agg seems doesn't has a list type(in schema).

pl.LazyFrame({"a": [1,None], "b": [1,2]})
        .group_by("b").agg(pl.col("a").drop_nulls())
        .schema

OrderedDict([('b', Int64), ('a', Int64)])

@ritchie46 should have some insight.

ritchie46 commented 4 months ago

I think function expr should call the to_field with Context::Aggregation. Not sure, but the schema is incorrect.

reswqa commented 4 months ago

I think function expr should call the to_field with Context::Aggregation.

You mean when function get its input field? If we had agg(pl.col("a").foo()), the input to foo() would be a list in Context::Aggregation, which also feels problematic.

Maybe FunctionExpr should handle Context in function.get_field (right now they are completely ignored). The only thing that bothering me is the return_scalar=True case, which is somewhat propagate at the moment. For instance, agg(pl.col("a").sum().pow(2))) is not nested even though pow does not return_scalar itself, so we can't simply wrap a list on its schema.

Wouittone commented 3 months ago

I was about to start working on this a bit further, but I now realize that I probably shouldn't have used a concat in the aggregation step.

I just tried running the example again without it and it works just as expected:

(
    pl.LazyFrame({"id": [1, 1, 2, 2, 2, 3, 3, 3, 3], "values": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
      .group_by("id")
      .agg("values")
      .select(pl.col("values").list.gather([0, 1]).list.to_array(2).alias("lists_start"))
      .collect()
)

>>> shape: (3, 1)
┌───────────────┐
│ lists_start   │
│ ---           │
│ array[i64, 2] │
╞═══════════════╡
│ [6, 7]        │
│ [1, 2]        │
│ [3, 4]        │
└───────────────┘

I personally feel that it would be fine to leave it as is, and maybe improve the documentation on this point if you think this is relevant? I'd be happy to give it a shot, if so.

Wouittone commented 3 months ago

(Sorry for the misclick... :disappointed:)

pola-rs / polars