pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

SchemaError when accessing list in group_by's agg #17455

Closed raayu83 closed 2 months ago

raayu83 commented 3 months ago

Checks

Reproducible example

# this works
import polars as pl

df = (
    pl.DataFrame({"label_1": ["a", "a"], "label_2": ["a", "b"]})
    .group_by(pl.col("label_1"))
    .agg(pl.col("label_2"))
    .select(pl.col("label_1"), pl.col("label_2").list.join(", "))
)
print(df)

# this doesn't work
df = (
    pl.DataFrame({"label_1": ["a", "a"], "label_2": ["a", "b"]})
    .group_by(pl.col("label_1"))
    .agg(pl.col("label_2").list.join(", "))
)
print(df)

Log output

keys/aggregates are not partitionable: running default HASH AGGREGATION
keys/aggregates are not partitionable: running default HASH AGGREGATION
Traceback (most recent call last):
  File "/Users/klst/pycharm/polars-test/example.py", line 12, in <module>
    pl.DataFrame({"label_1": ["a", "a"], "label_2": ["a", "b"]})
  File "/Users/klst/Library/Caches/pypoetry/virtualenvs/polars-test-tNQSEy4z-py3.9/lib/python3.9/site-packages/polars/dataframe/group_by.py", line 227, in agg
    self.df.lazy()
  File "/Users/klst/Library/Caches/pypoetry/virtualenvs/polars-test-tNQSEy4z-py3.9/lib/python3.9/site-packages/polars/lazyframe/frame.py", line 1942, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.SchemaError: invalid series dtype: expected `List`, got `str`

Issue description

When trying to access a list generated inside agg, a SchemaError is raised: polars.exceptions.SchemaError: invalid series dtype: expected List, got str

Accessing the list in a separate select works, but is more verbose than necessary.

Expected behavior

Accessing the list possible without any error. Output on command line for example:

shape: (1, 2)
┌─────────┬─────────┐
│ label_1 ┆ label_2 │
│ ---     ┆ ---     │
│ str     ┆ str     │
╞═════════╪═════════╡
│ a       ┆ a, b    │
└─────────┴─────────┘

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.9.6 (default, Feb 3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: openpyxl: pandas: pyarrow: pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 3 months ago

You don't have a list yet at that point:

(
    pl.DataFrame({"label_1": ["a", "a"], "label_2": ["a", "b"]})
    .group_by(pl.col("label_1"))
    .agg(pl.col("label_2").str.join(", "))
)
shape: (1, 2)
┌─────────┬─────────┐
│ label_1 ┆ label_2 │
│ ---     ┆ ---     │
│ str     ┆ str     │
╞═════════╪═════════╡
│ a       ┆ a, b    │
└─────────┴─────────┘
raayu83 commented 2 months ago

Hi @cmdlineluser , Could you please explain a bit further? Shouldn't it be a list at that point?

etiennebacher commented 2 months ago

It is a list:

(
    pl.DataFrame({"label_1": ["a", "a"], "label_2": ["a", "b"]})
    .group_by(pl.col("label_1"))
    .agg(pl.col("label_2"))
)
shape: (1, 2)
┌─────────┬────────────┐
│ label_1 ┆ label_2    │
│ ---     ┆ ---        │
│ str     ┆ list[str]  │
╞═════════╪════════════╡
│ a       ┆ ["a", "b"] │
└─────────┴────────────┘

But in .agg() this is not taken into account. For example in the examples for .agg(), you can see that .agg(pl.col('b')) is a list of i64 but you can just apply .sum() on it, not .list.sum().

cmdlineluser commented 2 months ago

The .agg() example produces a list but inside .agg() you do not yet have a list type.

df = pl.DataFrame({"id": ["a", "a", "b"], "value": ["x", "y", "z"]})

df.group_by("id").map_groups(lambda x:
    [print(x), x][-1]
)
shape: (2, 2)
┌─────┬───────┐
│ id  ┆ value │
│ --- ┆ ---   │
│ str ┆ str   │ # <- str
╞═════╪═══════╡
│ a   ┆ x     │
│ a   ┆ y     │
└─────┴───────┘

shape: (1, 2)
┌─────┬───────┐
│ id  ┆ value │
│ --- ┆ ---   │
│ str ┆ str   │ # <- str
╞═════╪═══════╡
│ b   ┆ z     │
└─────┴───────┘
raayu83 commented 2 months ago

Is that how it is supposed to be? I'd have expected it to be a list by that time.

raayu83 commented 2 months ago

@cmdlineluser Please let me ask again: Is that how you want it to be (it not being a list type at this point)? From an UX perspective it would be better if it was. Don't know if there are any technical reasons it can't be, though.

cmdlineluser commented 2 months ago

How I understand things is that a group_by operation is essentially a way to process specific slices of a dataframe.

Inside .agg() you are processing each "slice", so you still have the same initial column type:

df.slice(0, 2)
# shape: (2, 2)
# ┌─────┬───────┐
# │ id  ┆ value │ # Group a
# │ --- ┆ ---   │
# │ str ┆ str   │
# ╞═════╪═══════╡
# │ a   ┆ x     │
# │ a   ┆ y     │
# └─────┴───────┘
df.slice(2)
# shape: (1, 2)
# ┌─────┬───────┐
# │ id  ┆ value │ # Group b
# │ --- ┆ ---   │
# │ str ┆ str   │
# ╞═════╪═══════╡
# │ b   ┆ z     │
# └─────┴───────┘

The results are then accumulated into a list (or not, depending on the exact operation performed)

raayu83 commented 2 months ago

Thanks, now I think I understand what you mean. So basically inside sum I would first need to call a function that aggregates everything into a list and then I can use the list.

raayu83 commented 2 months ago

Hm but the following outputs a list[str] instead of a joined str:

import polars as pl

df = (
    pl.DataFrame({"label_1": ["a", "a"], "label_2": ["a", "b"]})
    .group_by(pl.col("label_1"))
    .agg(pl.col("label_2").implode().list.join(", "))
)
print(df)

Does that mean passing further Expressions to the aggregation is currently impossible because inside sum() the aggregation isn't done yet? But we are already passing an aggregation function, so could this maybe be changed? What would you think about that?

cmdlineluser commented 2 months ago

Does that mean passing further Expressions to the aggregation is currently impossible because inside sum() the aggregation isn't done yet?

There does not seem to be any sum() examples anywhere, so I'm not entirely sure what is being asked here.

but the following outputs a list[str] instead of a joined str

There is some extra information in this PR by the Polars author: https://github.com/pola-rs/polars/pull/6487

If we update the example there to current syntax:

df = pl.DataFrame({
    "group": [1, 2, 2, 3], 
    "value": ["a", "b", "c", "d"]
})

(
    df.group_by("group")
    .agg(
        pl.col("value").alias("values in groups"),
        pl.col("value").implode().alias("values in groups + implode"),
        pl.col("value").implode().list.join("-").alias("values in groups + list.join"),
        pl.col("value").str.join("-").alias("str.join reducer single item")
    )
)
# shape: (3, 5)
# ┌───────┬──────────────────┬────────────────────────────┬──────────────────────────────┬──────────────────────────────┐
# │ group ┆ values in groups ┆ values in groups + implode ┆ values in groups + list.join ┆ str.join reducer single item │
# │ ---   ┆ ---              ┆ ---                        ┆ ---                          ┆ ---                          │
# │ i64   ┆ list[str]        ┆ list[list[str]]            ┆ list[str]                    ┆ str                          │
# ╞═══════╪══════════════════╪════════════════════════════╪══════════════════════════════╪══════════════════════════════╡
# │ 3     ┆ ["d"]            ┆ [["d"]]                    ┆ ["d"]                        ┆ d                            │
# │ 1     ┆ ["a"]            ┆ [["a"]]                    ┆ ["a"]                        ┆ a                            │
# │ 2     ┆ ["b", "c"]       ┆ [["b", "c"]]               ┆ ["b-c"]                      ┆ b-c                          │
# └───────┴──────────────────┴────────────────────────────┴──────────────────────────────┴──────────────────────────────┘

My understanding is that .str.join() is a "reducer" (i.e. like .sum()) which gives you a single element.

By calling .implode() you introduce your own list value, which means you will still have the "outer list" representing the "values in groups".

raayu83 commented 2 months ago

Hi @cmdlineluser ,

thanks for the explanations!

What I was looking to achieve is the "reducer single item" above. So it was already possible and you don't even have to use implode.

Shows me I still need to learn a lot about polars.... Looking forward to the books that have been announced.

PierXuY commented 1 month ago

Does that mean passing further Expressions to the aggregation is currently impossible because inside sum() the aggregation isn't done yet?

There does not seem to be any sum() examples anywhere, so I'm not entirely sure what is being asked here.

but the following outputs a list[str] instead of a joined str

There is some extra information in this PR by the Polars author: #6487

  • (note: .implode() was previously called .list() at the time of that writing)

If we update the example there to current syntax:

df = pl.DataFrame({
    "group": [1, 2, 2, 3], 
    "value": ["a", "b", "c", "d"]
})

(
    df.group_by("group")
    .agg(
        pl.col("value").alias("values in groups"),
        pl.col("value").implode().alias("values in groups + implode"),
        pl.col("value").implode().list.join("-").alias("values in groups + list.join"),
        pl.col("value").str.join("-").alias("str.join reducer single item")
    )
)
# shape: (3, 5)
# ┌───────┬──────────────────┬────────────────────────────┬──────────────────────────────┬──────────────────────────────┐
# │ group ┆ values in groups ┆ values in groups + implode ┆ values in groups + list.join ┆ str.join reducer single item │
# │ ---   ┆ ---              ┆ ---                        ┆ ---                          ┆ ---                          │
# │ i64   ┆ list[str]        ┆ list[list[str]]            ┆ list[str]                    ┆ str                          │
# ╞═══════╪══════════════════╪════════════════════════════╪══════════════════════════════╪══════════════════════════════╡
# │ 3     ┆ ["d"]            ┆ [["d"]]                    ┆ ["d"]                        ┆ d                            │
# │ 1     ┆ ["a"]            ┆ [["a"]]                    ┆ ["a"]                        ┆ a                            │
# │ 2     ┆ ["b", "c"]       ┆ [["b", "c"]]               ┆ ["b-c"]                      ┆ b-c                          │
# └───────┴──────────────────┴────────────────────────────┴──────────────────────────────┴──────────────────────────────┘

My understanding is that .str.join() is a "reducer" (i.e. like .sum()) which gives you a single element.

By calling .implode() you introduce your own list value, which means you will still have the "outer list" representing the "values in groups".

Why the result type of pl.col("value").implode().list.join("-").alias("values ​​in groups + list.join"), is list[str] instead of str ?