Documentation for map_batches / map_elements / map_groups / map_rows is confusing

soerenwolfers commented 8 months ago

Description

Summary

The functionality split between map_batches / map_elements / map_groups / map_rows is complicated enough that it warrants having completely precise and rich-of-examples documentation. (Personally, I feel the API behind the polars functions itself might use some improvement as well, but that's another ticket, and not one that I as a novice should make claims about.)

Below, I list some specific gripes with the current documentation of the individual functions.

However, I think beyond fixing these, it'd be useful to have some joint documentation of the differences and use-cases of each of the various UDF functions. For example, I really like numpy's documentation of (generalized) ufuncs at https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html in comparison.

map_elements

Documentation for map_elements

Confusingly, map_elements has a name that very clearly suggests that the function here is applied pointwise, or at least should operate in a pointwise fashion, yet the documentation indicates that it can be used to emulate grouped folds, such as group_by().cum_sum(), given that according to the documentation:

GroupBy
Expects function to be of type Callable[[Series], Any]. For each group, applies a Python function to the slice of the column corresponding to that group.

I think it should be more clearly highlighted that map_elements is allowed in a non-pointwise fashion in group-by contexts (or that it is not, in case this is an implementation detail that may change)

map_groups

Documentation for map_groups

The description says nothing but "Apply a custom/user-defined function (UDF) in a GroupBy context." which leaves the question why this function exists, given that map_elements can also apply a UDF in a GroupBy context as per it's own documentation
The documentation says nothing about the "function" parameter except that it contradicts what's in the signature, i.e., that it is fed a Sequence of Series.

map_batches

Documentation for map_batches

This is the worst.

First, nowhere in the documentation is it even hinted at what a "batch" is. Instead, the documentation starts by saying
```
Apply a custom python function to a whole Series or sequence of Series.
```
which is confusing because it sounds like the same that map_groups and map_elements do in a group by context, and the author says at https://github.com/pola-rs/polars/issues/12941 that
```
A batch by definition doesn't include a whole group. E.g. if you take a batch from a data set and you sort that batch, you shouldn't assume the whole dataset is sorted.
```
which seems to contradict the claim that the function is applied to "whole Series". Being new to polars, I don't fully understand what this means (that's not a complaint, issues don't have to use language for beginners), but confusingly in another place he seems to contradict (but this may just be my misunderstanding) even that, saying that map_batches doesn't even get fed complete "batches":
```
> Methods are applied to the object they are called on. So if I call map on a Series, I expect to map the whole Series.
```

This is not guaranteed. Expr.map may get an incomplete batch. This happens for instance in the streaming engine.


- Next, the documentation says

Warning

If you are looking to map a function over a window function or group_by context, refer to map_elements() instead.

yet the description of the `agg_list` parameter says

Aggregate the values of the expression into a list before applying the function. This parameter only works in a group-by context


and half the documentation is spent discussing how to use that parameter in a group by context.

- Finally, I have absolutely no idea what contract `map_batches` imposes on its `function` arguments, and whether that contract depends on the context in which it is used. That seems much more important to know than implementation details about what the function is applied to in the end. Same goes for all three really.

### Link

_No response_

braaannigan commented 8 months ago

Good points @soerenwolfers - I'd be happy to help if you want to make a PR

soerenwolfers commented 8 months ago

I would need to be certain about what's going first, and I think beyond playing with the functions to see what's currently possible I would also need some guidance on what the intentions are.

For example, I don't currently understand why there'd be more than

On the one hand, "elementwise mappings", the functions supplied to which would have to operate element-/row-wise but are allowed to set a vectorized parameter that determines whether the supplied function is () -> () (accepts scalar, returns scalar) or (n) -> (n) (accepts series, returns series). That's what I would call map_rows. On expressions, the inputs would either be individual values of a single series (e.g. pl.col('a').map_rows(f, vectorized=False)) or arbitrarily sized (decided by implementation details) chunks of series (e.g. pl.col('a').map_rows(f, vectorized=True)).
On the other hand, "groupwise mappings", the functions supplied to which would not need to satisfy any contract except needing to be (n)->(n)(accepts series, returns series) in a window context and (n)->() (accepts series, returns scalar) in a groupby context. That's what I would call map_groups. On expressions (e.g., pol.col('a').map_groups(f).over('b') or df.group_by('b').agg(pl.col('a').map_groups(f))) the inputs would be chunks (corresponding to the entire respective window partition or group, or the entire series if applied without context) of a series.

(To avoid having to pack multiple columns into a single struct column, both of these could also be allowed to be applied on entire dataframes (e.g., df.map_rows(f, vectorized=False) or df.map_rows(f, vectorized=True) or df.group_by('b').map_groups(f) or df.map_groups(f).over('b')), but there is a choice to be made then whether to supply dataframes or series-of-row-type in that case, the best decision for which which will depend on polars philosophy/implementation details.)

In particular, I would only have two names instead of four.

braaannigan commented 8 months ago

I can't speak for the devs, but I don't think there's any drive to change the names or API, I think improving the documentation would be enough

soerenwolfers commented 8 months ago

Yeah I get that, but I for one cannot improve the documentation because I don't know what's going on.

cmdlineluser commented 8 months ago

There is also the top-level pl.map_groups (and map_batches, but no "elements")

.map_elements changes behaviour depending on the context i.e. it is "map_rows" in select context but inside a groupby context it is the same as .map_groups (without the outer [])?

import json
import polars as pl

df = pl.DataFrame(dict(
    group=[1, 2, 3, 1, 2], 
    value=[1, 2, 3, 4, 5]
))

df.with_columns(
    elements = pl.col("value").map_elements(lambda x: json.dumps(list(x))).over("group"),
    groups   = pl.map_groups("value", lambda x: json.dumps(list(x[0]))).over("group")
)

# shape: (5, 4)
# ┌───────┬───────┬──────────┬────────┐
# │ group ┆ value ┆ elements ┆ groups │
# │ ---   ┆ ---   ┆ ---      ┆ ---    │
# │ i64   ┆ i64   ┆ str      ┆ str    │
# ╞═══════╪═══════╪══════════╪════════╡
# │ 1     ┆ 1     ┆ [1, 4]   ┆ [1, 4] │
# │ 2     ┆ 2     ┆ [2, 5]   ┆ [2, 5] │
# │ 3     ┆ 3     ┆ [3]      ┆ [3]    │
# │ 1     ┆ 4     ┆ [1, 4]   ┆ [1, 4] │
# │ 2     ┆ 5     ┆ [2, 5]   ┆ [2, 5] │
# └───────┴───────┴──────────┴────────┘

Should there not just be Expr.map_rows / Expr.map_groups and disallow .map_rows in a groupby context?

vadimcn commented 4 months ago

On top of this, there seem to be 3 flavors of map_batches, all slightly different:

Why LazyFrame.map_batches takes a DataFrame, while others take a Series? Nevertheless, it seems that the DataFrame that the function is called with, always contains only one column?

Why isn't there streamable of is_elementwise parameter for polars.map_batches?

According to documentation, polars.map_batches may be applied to multiple expr's at once, but will the function be called with each series separately, or all at once (as multiple arguments)?

Also, why isn't there DataFrame.map_batches?

I would concur with the OP: map_... APIs are in a serious need of rationalization.

pola-rs / polars