Closed raayu83 closed 2 months ago
You don't have a list
yet at that point:
(
pl.DataFrame({"label_1": ["a", "a"], "label_2": ["a", "b"]})
.group_by(pl.col("label_1"))
.agg(pl.col("label_2").str.join(", "))
)
shape: (1, 2)
┌─────────┬─────────┐
│ label_1 ┆ label_2 │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════════╡
│ a ┆ a, b │
└─────────┴─────────┘
Hi @cmdlineluser , Could you please explain a bit further? Shouldn't it be a list at that point?
It is a list:
(
pl.DataFrame({"label_1": ["a", "a"], "label_2": ["a", "b"]})
.group_by(pl.col("label_1"))
.agg(pl.col("label_2"))
)
shape: (1, 2)
┌─────────┬────────────┐
│ label_1 ┆ label_2 │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════╪════════════╡
│ a ┆ ["a", "b"] │
└─────────┴────────────┘
But in .agg()
this is not taken into account. For example in the examples for .agg()
, you can see that .agg(pl.col('b'))
is a list of i64 but you can just apply .sum()
on it, not .list.sum()
.
The .agg()
example produces a list
but inside .agg()
you do not yet have a list type.
df = pl.DataFrame({"id": ["a", "a", "b"], "value": ["x", "y", "z"]})
df.group_by("id").map_groups(lambda x:
[print(x), x][-1]
)
shape: (2, 2)
┌─────┬───────┐
│ id ┆ value │
│ --- ┆ --- │
│ str ┆ str │ # <- str
╞═════╪═══════╡
│ a ┆ x │
│ a ┆ y │
└─────┴───────┘
shape: (1, 2)
┌─────┬───────┐
│ id ┆ value │
│ --- ┆ --- │
│ str ┆ str │ # <- str
╞═════╪═══════╡
│ b ┆ z │
└─────┴───────┘
Is that how it is supposed to be? I'd have expected it to be a list by that time.
@cmdlineluser Please let me ask again: Is that how you want it to be (it not being a list type at this point)? From an UX perspective it would be better if it was. Don't know if there are any technical reasons it can't be, though.
How I understand things is that a group_by operation is essentially a way to process specific slices of a dataframe.
Inside .agg()
you are processing each "slice", so you still have the same initial column type:
df.slice(0, 2)
# shape: (2, 2)
# ┌─────┬───────┐
# │ id ┆ value │ # Group a
# │ --- ┆ --- │
# │ str ┆ str │
# ╞═════╪═══════╡
# │ a ┆ x │
# │ a ┆ y │
# └─────┴───────┘
df.slice(2)
# shape: (1, 2)
# ┌─────┬───────┐
# │ id ┆ value │ # Group b
# │ --- ┆ --- │
# │ str ┆ str │
# ╞═════╪═══════╡
# │ b ┆ z │
# └─────┴───────┘
The results are then accumulated into a list (or not, depending on the exact operation performed)
Thanks, now I think I understand what you mean. So basically inside sum I would first need to call a function that aggregates everything into a list and then I can use the list.
Hm but the following outputs a list[str] instead of a joined str:
import polars as pl
df = (
pl.DataFrame({"label_1": ["a", "a"], "label_2": ["a", "b"]})
.group_by(pl.col("label_1"))
.agg(pl.col("label_2").implode().list.join(", "))
)
print(df)
Does that mean passing further Expressions to the aggregation is currently impossible because inside sum() the aggregation isn't done yet? But we are already passing an aggregation function, so could this maybe be changed? What would you think about that?
Does that mean passing further Expressions to the aggregation is currently impossible because inside sum() the aggregation isn't done yet?
There does not seem to be any sum()
examples anywhere, so I'm not entirely sure what is being asked here.
but the following outputs a list[str] instead of a joined str
There is some extra information in this PR by the Polars author: https://github.com/pola-rs/polars/pull/6487
.implode()
was previously called .list()
at the time of that writing)If we update the example there to current syntax:
df = pl.DataFrame({
"group": [1, 2, 2, 3],
"value": ["a", "b", "c", "d"]
})
(
df.group_by("group")
.agg(
pl.col("value").alias("values in groups"),
pl.col("value").implode().alias("values in groups + implode"),
pl.col("value").implode().list.join("-").alias("values in groups + list.join"),
pl.col("value").str.join("-").alias("str.join reducer single item")
)
)
# shape: (3, 5)
# ┌───────┬──────────────────┬────────────────────────────┬──────────────────────────────┬──────────────────────────────┐
# │ group ┆ values in groups ┆ values in groups + implode ┆ values in groups + list.join ┆ str.join reducer single item │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ list[str] ┆ list[list[str]] ┆ list[str] ┆ str │
# ╞═══════╪══════════════════╪════════════════════════════╪══════════════════════════════╪══════════════════════════════╡
# │ 3 ┆ ["d"] ┆ [["d"]] ┆ ["d"] ┆ d │
# │ 1 ┆ ["a"] ┆ [["a"]] ┆ ["a"] ┆ a │
# │ 2 ┆ ["b", "c"] ┆ [["b", "c"]] ┆ ["b-c"] ┆ b-c │
# └───────┴──────────────────┴────────────────────────────┴──────────────────────────────┴──────────────────────────────┘
My understanding is that .str.join()
is a "reducer" (i.e. like .sum()
) which gives you a single element.
By calling .implode()
you introduce your own list value, which means you will still have the "outer list" representing the "values in groups".
Hi @cmdlineluser ,
thanks for the explanations!
What I was looking to achieve is the "reducer single item" above. So it was already possible and you don't even have to use implode.
Shows me I still need to learn a lot about polars.... Looking forward to the books that have been announced.
Does that mean passing further Expressions to the aggregation is currently impossible because inside sum() the aggregation isn't done yet?
There does not seem to be any
sum()
examples anywhere, so I'm not entirely sure what is being asked here.but the following outputs a list[str] instead of a joined str
There is some extra information in this PR by the Polars author: #6487
- (note:
.implode()
was previously called.list()
at the time of that writing)If we update the example there to current syntax:
df = pl.DataFrame({ "group": [1, 2, 2, 3], "value": ["a", "b", "c", "d"] }) ( df.group_by("group") .agg( pl.col("value").alias("values in groups"), pl.col("value").implode().alias("values in groups + implode"), pl.col("value").implode().list.join("-").alias("values in groups + list.join"), pl.col("value").str.join("-").alias("str.join reducer single item") ) )
# shape: (3, 5) # ┌───────┬──────────────────┬────────────────────────────┬──────────────────────────────┬──────────────────────────────┐ # │ group ┆ values in groups ┆ values in groups + implode ┆ values in groups + list.join ┆ str.join reducer single item │ # │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ # │ i64 ┆ list[str] ┆ list[list[str]] ┆ list[str] ┆ str │ # ╞═══════╪══════════════════╪════════════════════════════╪══════════════════════════════╪══════════════════════════════╡ # │ 3 ┆ ["d"] ┆ [["d"]] ┆ ["d"] ┆ d │ # │ 1 ┆ ["a"] ┆ [["a"]] ┆ ["a"] ┆ a │ # │ 2 ┆ ["b", "c"] ┆ [["b", "c"]] ┆ ["b-c"] ┆ b-c │ # └───────┴──────────────────┴────────────────────────────┴──────────────────────────────┴──────────────────────────────┘
My understanding is that
.str.join()
is a "reducer" (i.e. like.sum()
) which gives you a single element.By calling
.implode()
you introduce your own list value, which means you will still have the "outer list" representing the "values in groups".
Why the result type of pl.col("value").implode().list.join("-").alias("values in groups + list.join"),
is list[str] instead of str ?
Checks
Reproducible example
Log output
Issue description
When trying to access a list generated inside agg, a SchemaError is raised: polars.exceptions.SchemaError: invalid series dtype: expected
List
, gotstr
Accessing the list in a separate select works, but is more verbose than necessary.
Expected behavior
Accessing the list possible without any error. Output on command line for example:
Installed versions