pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.29k stars 1.96k forks source link

Documentation for map_batches / map_elements / map_groups / map_rows is confusing #14521

Open soerenwolfers opened 8 months ago

soerenwolfers commented 8 months ago

Description

Summary

The functionality split between map_batches / map_elements / map_groups / map_rows is complicated enough that it warrants having completely precise and rich-of-examples documentation. (Personally, I feel the API behind the polars functions itself might use some improvement as well, but that's another ticket, and not one that I as a novice should make claims about.)

Below, I list some specific gripes with the current documentation of the individual functions.

However, I think beyond fixing these, it'd be useful to have some joint documentation of the differences and use-cases of each of the various UDF functions. For example, I really like numpy's documentation of (generalized) ufuncs at https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html in comparison.

map_elements

Documentation for map_elements

Confusingly, map_elements has a name that very clearly suggests that the function here is applied pointwise, or at least should operate in a pointwise fashion, yet the documentation indicates that it can be used to emulate grouped folds, such as group_by().cum_sum(), given that according to the documentation:

GroupBy
Expects function to be of type Callable[[Series], Any]. For each group, applies a Python function to the slice of the column corresponding to that group.

I think it should be more clearly highlighted that map_elements is allowed in a non-pointwise fashion in group-by contexts (or that it is not, in case this is an implementation detail that may change)

map_groups

Documentation for map_groups

map_batches

Documentation for map_batches

This is the worst.

This is not guaranteed. Expr.map may get an incomplete batch. This happens for instance in the streaming engine.


- Next, the documentation says 

Warning

If you are looking to map a function over a window function or group_by context, refer to map_elements() instead.

yet the description of the `agg_list` parameter says

Aggregate the values of the expression into a list before applying the function. This parameter only works in a group-by context


and half the documentation is spent discussing how to use that parameter in a group by context.

- Finally, I have absolutely no idea what contract `map_batches` imposes on its `function` arguments, and whether that contract depends on the context in which it is used. That seems much more important to know than implementation details about what the function is applied to in the end. Same goes for all three really.

### Link

_No response_
braaannigan commented 8 months ago

Good points @soerenwolfers - I'd be happy to help if you want to make a PR

soerenwolfers commented 8 months ago

I would need to be certain about what's going first, and I think beyond playing with the functions to see what's currently possible I would also need some guidance on what the intentions are.

For example, I don't currently understand why there'd be more than

(To avoid having to pack multiple columns into a single struct column, both of these could also be allowed to be applied on entire dataframes (e.g., df.map_rows(f, vectorized=False) or df.map_rows(f, vectorized=True) or df.group_by('b').map_groups(f) or df.map_groups(f).over('b')), but there is a choice to be made then whether to supply dataframes or series-of-row-type in that case, the best decision for which which will depend on polars philosophy/implementation details.)

In particular, I would only have two names instead of four.

braaannigan commented 8 months ago

I can't speak for the devs, but I don't think there's any drive to change the names or API, I think improving the documentation would be enough

soerenwolfers commented 8 months ago

Yeah I get that, but I for one cannot improve the documentation because I don't know what's going on.

cmdlineluser commented 8 months ago

There is also the top-level pl.map_groups (and map_batches, but no "elements")

.map_elements changes behaviour depending on the context i.e. it is "map_rows" in select context but inside a groupby context it is the same as .map_groups (without the outer [])?

import json
import polars as pl

df = pl.DataFrame(dict(
    group=[1, 2, 3, 1, 2], 
    value=[1, 2, 3, 4, 5]
))

df.with_columns(
    elements = pl.col("value").map_elements(lambda x: json.dumps(list(x))).over("group"),
    groups   = pl.map_groups("value", lambda x: json.dumps(list(x[0]))).over("group")
)
# shape: (5, 4)
# ┌───────┬───────┬──────────┬────────┐
# │ group ┆ value ┆ elements ┆ groups │
# │ ---   ┆ ---   ┆ ---      ┆ ---    │
# │ i64   ┆ i64   ┆ str      ┆ str    │
# ╞═══════╪═══════╪══════════╪════════╡
# │ 1     ┆ 1     ┆ [1, 4]   ┆ [1, 4] │
# │ 2     ┆ 2     ┆ [2, 5]   ┆ [2, 5] │
# │ 3     ┆ 3     ┆ [3]      ┆ [3]    │
# │ 1     ┆ 4     ┆ [1, 4]   ┆ [1, 4] │
# │ 2     ┆ 5     ┆ [2, 5]   ┆ [2, 5] │
# └───────┴───────┴──────────┴────────┘

Should there not just be Expr.map_rows / Expr.map_groups and disallow .map_rows in a groupby context?

vadimcn commented 4 months ago

On top of this, there seem to be 3 flavors of map_batches, all slightly different:

Why LazyFrame.map_batches takes a DataFrame, while others take a Series? Nevertheless, it seems that the DataFrame that the function is called with, always contains only one column?

Why isn't there streamable of is_elementwise parameter for polars.map_batches?

According to documentation, polars.map_batches may be applied to multiple expr's at once, but will the function be called with each series separately, or all at once (as multiple arguments)?

Also, why isn't there DataFrame.map_batches?

I would concur with the OP: map_... APIs are in a serious need of rationalization.