Open soerenwolfers opened 8 months ago
Good points @soerenwolfers - I'd be happy to help if you want to make a PR
I would need to be certain about what's going first, and I think beyond playing with the functions to see what's currently possible I would also need some guidance on what the intentions are.
For example, I don't currently understand why there'd be more than
vectorized
parameter that determines whether the supplied function is () -> ()
(accepts scalar, returns scalar) or (n) -> (n)
(accepts series, returns series).
That's what I would call map_rows
.
On expressions, the inputs would either be individual values of a single series (e.g. pl.col('a').map_rows(f, vectorized=False)
) or arbitrarily sized (decided by implementation details) chunks of series (e.g. pl.col('a').map_rows(f, vectorized=True)
).(n)->(n)
(accepts series, returns series) in a window context and (n)->()
(accepts series, returns scalar) in a groupby context. That's what I would call map_groups
. On expressions (e.g., pol.col('a').map_groups(f).over('b')
or df.group_by('b').agg(pl.col('a').map_groups(f))
) the inputs would be chunks (corresponding to the entire respective window partition or group, or the entire series if applied without context) of a series. (To avoid having to pack multiple columns into a single struct column, both of these could also be allowed to be applied on entire dataframes (e.g., df.map_rows(f, vectorized=False)
or df.map_rows(f, vectorized=True)
or df.group_by('b').map_groups(f)
or df.map_groups(f).over('b')
), but there is a choice to be made then whether to supply dataframes or series-of-row-type in that case, the best decision for which which will depend on polars philosophy/implementation details.)
In particular, I would only have two names instead of four.
I can't speak for the devs, but I don't think there's any drive to change the names or API, I think improving the documentation would be enough
Yeah I get that, but I for one cannot improve the documentation because I don't know what's going on.
There is also the top-level pl.map_groups
(and map_batches
, but no "elements")
.map_elements
changes behaviour depending on the context i.e. it is "map_rows" in select context but inside a groupby context it is the same as .map_groups
(without the outer []
)?
import json
import polars as pl
df = pl.DataFrame(dict(
group=[1, 2, 3, 1, 2],
value=[1, 2, 3, 4, 5]
))
df.with_columns(
elements = pl.col("value").map_elements(lambda x: json.dumps(list(x))).over("group"),
groups = pl.map_groups("value", lambda x: json.dumps(list(x[0]))).over("group")
)
# shape: (5, 4)
# ┌───────┬───────┬──────────┬────────┐
# │ group ┆ value ┆ elements ┆ groups │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str ┆ str │
# ╞═══════╪═══════╪══════════╪════════╡
# │ 1 ┆ 1 ┆ [1, 4] ┆ [1, 4] │
# │ 2 ┆ 2 ┆ [2, 5] ┆ [2, 5] │
# │ 3 ┆ 3 ┆ [3] ┆ [3] │
# │ 1 ┆ 4 ┆ [1, 4] ┆ [1, 4] │
# │ 2 ┆ 5 ┆ [2, 5] ┆ [2, 5] │
# └───────┴───────┴──────────┴────────┘
Should there not just be Expr.map_rows
/ Expr.map_groups
and disallow .map_rows
in a groupby context?
On top of this, there seem to be 3 flavors of map_batches
, all slightly different:
Why LazyFrame.map_batches
takes a DataFrame
, while others take a Series
? Nevertheless, it seems that the DataFrame
that the function is called with, always contains only one column?
Why isn't there streamable
of is_elementwise
parameter for polars.map_batches
?
According to documentation, polars.map_batches
may be applied to multiple expr's at once, but will the function be called with each series separately, or all at once (as multiple arguments)?
Also, why isn't there DataFrame.map_batches
?
I would concur with the OP: map_...
APIs are in a serious need of rationalization.
Description
Summary
The functionality split between
map_batches
/map_elements
/map_groups
/map_rows
is complicated enough that it warrants having completely precise and rich-of-examples documentation. (Personally, I feel the API behind the polars functions itself might use some improvement as well, but that's another ticket, and not one that I as a novice should make claims about.)Below, I list some specific gripes with the current documentation of the individual functions.
However, I think beyond fixing these, it'd be useful to have some joint documentation of the differences and use-cases of each of the various UDF functions. For example, I really like numpy's documentation of (generalized) ufuncs at https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html in comparison.
map_elements
Documentation for map_elements
Confusingly,
map_elements
has a name that very clearly suggests that the function here is applied pointwise, or at least should operate in a pointwise fashion, yet the documentation indicates that it can be used to emulate grouped folds, such asgroup_by().cum_sum()
, given that according to the documentation:I think it should be more clearly highlighted that
map_elements
is allowed in a non-pointwise fashion in group-by contexts (or that it is not, in case this is an implementation detail that may change)map_groups
Documentation for map_groups
map_elements
can also apply a UDF in a GroupBy context as per it's own documentation"function"
parameter except that it contradicts what's in the signature, i.e., that it is fed a Sequence of Series.map_batches
Documentation for map_batches
This is the worst.
which is confusing because it sounds like the same that
map_groups
andmap_elements
do in a group by context, and the author says at https://github.com/pola-rs/polars/issues/12941 thatwhich seems to contradict the claim that the function is applied to "whole Series". Being new to polars, I don't fully understand what this means (that's not a complaint, issues don't have to use language for beginners), but confusingly in another place he seems to contradict (but this may just be my misunderstanding) even that, saying that
map_batches
doesn't even get fed complete "batches":This is not guaranteed. Expr.map may get an incomplete batch. This happens for instance in the streaming engine.
Warning
If you are looking to map a function over a window function or group_by context, refer to
map_elements()
instead.Aggregate the values of the expression into a list before applying the function. This parameter only works in a group-by context