pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.05k stars 1.83k forks source link

Suggestion to update the UDF documentation #12912

Open Chuck321123 opened 9 months ago

Chuck321123 commented 9 months ago

Description

So by reading the documentation I'm still struggling on how I would implement longer UDF functions from python to polars. Would be nice if one or several examples could be dropped. Also in pandas, I used transform a lot compared to apply and map as you could do vectorized UDF operations similar to this: df.groupby("Group")["Relevant_Column"].transform(My_UDF_function). How would this work in polars if the UDF functions become slightly advanced? Would be nice to update the documentation a bit and see more examples of this

Link

https://pola-rs.github.io/polars/user-guide/expressions/user-defined-functions/

No response

MarcoGorelli commented 9 months ago

could you show an example of a udf you'd like to use?

Chuck321123 commented 9 months ago

@MarcoGorelli Unfortunately, I don't have any now. I do, however, have a simpler UDF that revolves third-party packages. Lets take this fictive example: import Thirdpartypackage as tp

df["New_Col"] = df.groupby("Group")["Relevant_Column"].transform(lambda x: tp.tp_attribute(necessary_variable = x, optional_variable = 2)) Would for example be nice to extend the documentations around how one would implement stuff like this

MarcoGorelli commented 9 months ago

thanks - have you looked at https://pola-rs.github.io/polars/user-guide/migration/pandas/#pandas-transform ? does this answer your question?

(btw, note that if you're running a third party package with a lambda, then pandas transform won't be vectorised either)

deanm0000 commented 9 months ago

One thing to note is how powerful using numba's guvectorize is for making UDFs since they are vectorized ufuncs.

Chuck321123 commented 9 months ago

@MarcoGorelli You are right. Yes, i looked at it and understand that polars want us to use native functions instead. Although not always possible. Tried this df = df.with_columns(pl.col("Relevant_Column").map_batches(lambda x: tp.tp_attribute(necessary_variable = x, optional_variable = 2)).over("Group").alias("New_Col")) but with no luck (map and apply has been renamed to map_batches and map_elements).

MarcoGorelli commented 9 months ago

thanks - what do you mean by 'no luck'? could you make a reproducible example please

deanm0000 commented 9 months ago

You often need to wrap the custom function in pl.Series like

df = df.with_columns(pl.col("Relevant_Column").map_batches(lambda x: pl.Series(tp.tp_attribute(necessary_variable = x, optional_variable = 2))).over("Group").alias("New_Col"))
zaa730Sight commented 9 months ago

One thing to note is how powerful using numba's guvectorize is for making UDFs since they are vectorized ufuncs.

Strongly agree with this. That being said, for heavy usage the syntax gets kind of clunky. For users that know what they're doing however, I suspect the Polar's team would rather steer them into writing native Rust kernels with the py03-polars extension.

But all in all, I find it hard to beat the rapid prototyping & performance, all from a single python environment that Numba guvectorize + Polars(including the Lazyframe eval) can get you. There are some rough edges with this though.

Starting with the next Numba release, we should get this: https://github.com/numba/numba/pull/9058 It would be ideal to get a multi-dim output (MxN) from a single call. Perhaps Polars would be able to wrap that in a Struct/List/Array of values as a single column (currently it can natively get you that pl.Series of compatible type back).

Also, looking forward to the future, it would be interesting to see if Polars will play nice with Mojo kernels/code, as (assuming Mojo takes off) I wager that would be a decent overlap among the user bases.

Chuck321123 commented 9 months ago

@deanm0000 I tried, and unfortunately, it didn't work. @MarcoGorelli, here is a simple reproducible example:

import pandas as pd
import numpy as np
import pandas_ta as ta
import polars as pl
np.random.seed(42)
df = pd.DataFrame({
    'Group': np.random.choice(['A', 'B', 'C'], size=50),
    'Values': np.random.rand(50) 
})
df = df.sort_values(by='Group')
df["Moving_Average"] = (df.groupby('Group')["Values"].transform(lambda x: ta.sma(close=x, length=2))).bfill()
# Alternative:
def My_UDF_Function(x):
    return ta.sma(close=x, length=2)
df["MA2"] = df.groupby('Group')["Values"].transform(My_UDF_Function).bfill()

Keep in mind that you need to do pip install pandas-ta first. I already know this can be done in polars by using the rolling function, but thought of leaving an example here

cmdlineluser commented 9 months ago

.map_batches() is not "group aware", it needs to be .map_elements()

It is mentioned in the Notes section of the docs, but perhaps it could be made clearer.

If you are looking to map a function over a window function or group_by context, refer to func:map_elements instead.

(Looks like there is also some formatting issue with the rst file?)

MarcoGorelli commented 9 months ago

looks to me like pandas-ta specifically wants a pandas series - not a polars series, and not a numpy array, it won't work with anything else:

In [42]: ta.sma(dfpl['Values'], length=2)

In [43]: ta.sma(dfpl['Values'].to_numpy(), length=2)

In [44]: ta.sma(pd.Series(dfpl['Values']), length=2)
Out[44]:
0          NaN
1     0.795927
2     0.643506
3     0.690988
4     0.525113
5     0.267241
6     0.262692
7     0.494509
8     0.680310
9     0.742702
10    0.762478
11    0.646630
12    0.389337
13    0.055688
14    0.333522
15    0.333357
16    0.090452
17    0.220633
18    0.362725
19    0.410134
20    0.665995
21    0.678027
22    0.531861
23    0.698057
24    0.826965
25    0.660905
26    0.422452
27    0.240534
28    0.102339
29    0.347099
30    0.375864
31    0.235960
32    0.545532
33    0.841617
34    0.879748
35    0.766130
36    0.498510
37    0.551201
38    0.502415
39    0.153784
40    0.384721
41    0.589381
42    0.424279
43    0.496428
44    0.431987
45    0.451654
46    0.786145
47    0.776003
48    0.384494
49    0.283826
Name: SMA_2, dtype: float64

so, barring converting to pandas (or requesting that they support polars) I'm not sure there's much that can be done here

deanm0000 commented 9 months ago

Is this similar enough?

https://stackoverflow.com/questions/77160103/exponential-moving-average-ema-calculations-in-polars-dataframe

mkleinbort-ic commented 9 months ago

For reference, here is a good example of nb.guvectorize

https://stackoverflow.com/questions/77523657/how-do-you-condense-long-recursive-polars-expressions/77525171?noredirect=1#comment136683582_77525171

But going back to your question - have you tried just using .map_elements on the list of values itself?

Polars code



df = pl.DataFrame({
    'Group': np.random.choice(['A', 'B', 'C'], size=50),
    'Values': np.random.rand(50) 
})

df1 = df.group_by('Group').agg(pl.col('Values'))
# This is what we have so far:

# shape: (3, 2)
# ┌───────┬──────────────────────────────────┐
# │ Group ┆ Values                           │
# │ ---   ┆ ---                              │
# │ str   ┆ list[f64]                        │
# ╞═══════╪══════════════════════════════════╡
# │ C     ┆ [0.035229, 0.925109, … 0.297392] │
# │ A     ┆ [0.789259, 0.004903, … 0.56967]  │
# │ B     ┆ [0.917179, 0.396942, … 0.962115] │
# └───────┴──────────────────────────────────┘

df2 = df1.with_columns(**{
    'Moving Average': pl.col('Values').map_elements(lambda x: ....) # Use your UDF that takes in list of values here
})

# Then an explode to have it in the original format.
MarcoGorelli commented 9 months ago

that won't work I'm afraid

In [14]: df2 = df1.with_columns(**{
    ...:     'Moving Average': pl.col('Values').map_elements(lambda x: ta.sma(x, lenght=2)) # Us
    ...: e your UDF that takes in list of values here
    ...: })

In [15]: df2
Out[15]:
shape: (3, 3)
┌───────┬──────────────────────────────────┬────────────────┐
│ Group ┆ Values                           ┆ Moving Average │
│ ---   ┆ ---                              ┆ ---            │
│ str   ┆ list[f64]                        ┆ list[null]     │
╞═══════╪══════════════════════════════════╪════════════════╡
│ B     ┆ [0.015636, 0.165267, … 0.520834] ┆ null           │
│ C     ┆ [0.230894, 0.241025, … 0.385417] ┆ null           │
│ A     ┆ [0.844534, 0.74732, … 0.651077]  ┆ null           │
└───────┴──────────────────────────────────┴────────────────┘

The issue, as far as I can tell, is that pandas-ta only accepts pandas series, and nothing else (not even numpy array)

I'd suggest opening a feature request to them to accept numpy arrays (or, even better, Polars)

mkleinbort-ic commented 9 months ago

The issue, as far as I can tell, is that pandas-ta only accepts pandas series, and nothing else (not even numpy array)

Can't that be fixed with code like:

def pandas_ta_wrapper(x:list)->pd.Series:
     return pd.Series(x)

df2 = df1.with_columns(**{
     'Moving Average': pl.col('Values').map_elements(lambda x: ta.sma(pandas_ta_wrapper(x), lenght=2))
})
MarcoGorelli commented 9 months ago

You'd need to then convert back to Polars

This works:

(
    dfpl.with_columns(
        MA2=pl.col("Values")
        .map_batches(lambda x: pl.Series(ta.sma(x.to_pandas(), length=2)))
        .fill_null(strategy="backward")
        .over("Group")
    )
)

EDIT

This comment is not correct, it requires map_elements, not map_batches

Chuck321123 commented 9 months ago

Thanks for the input. Ye, I didn't think the package explicitly needed a pandas series @MarcoGorelli to work. However, i tried you solution and it worked, and surprisingly with a ~30% better performance on larger datasets.

cmdlineluser commented 9 months ago

The last example uses .map_batches( so I don't think it's equivalent.

(It's not generating per-group values.)

MarcoGorelli commented 9 months ago

it's using map_batches, followed by over, so it should be equivalent. ~at least, if I try it, I get the same result:~ I may have missed boundary details, just checking

hmm this doesn't look like I would have expected it to - have opened https://github.com/pola-rs/polars/issues/12941 about this

In [11]: df = pl.DataFrame({'a': [1,1, 2], 'b': [4,5,6]})

In [12]: df.with_columns(pl.col('b').map_batches(lambda x: x.shift()).over('a'))
Out[12]:
shape: (3, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ i64  │
╞═════╪══════╡
│ 1   ┆ null │
│ 1   ┆ 4    │
│ 2   ┆ 5    │
└─────┴──────┘
cmdlineluser commented 9 months ago

Yeah, I mentioned it earlier: https://github.com/pola-rs/polars/issues/12912#issuecomment-1843680282

map_batches always passes the full column, it doesn't do groups.

If you are looking to map a function over a window function or group_by context, refer to func:map_elements instead.

(The distinction was perhaps clearer with the previous names of map vs. apply because one could think of a "group" as a "batch")

import pandas_ta as ta
import polars as pl

df = pl.DataFrame({
   "Group":  ["A", "A", "B", "C", "C", "C"],
   "Values": [1, 2, 3, 4, 5, 6]
})
df.with_columns(
   MA2=pl.col("Values")
   .map_batches(lambda x: [print(x), pl.Series(ta.sma(x.to_pandas(), length=2))][1])
   .over("Group")
)
# shape: (6,)
# Series: 'Values' [i64]
# [
#   1
#   2
#   3
#   4
#   5
#   6
# ]
# shape: (6, 3)
# ┌───────┬────────┬──────┐
# │ Group ┆ Values ┆ MA2  │
# │ ---   ┆ ---    ┆ ---  │
# │ str   ┆ i64    ┆ f64  │
# ╞═══════╪════════╪══════╡
# │ A     ┆ 1      ┆ null │
# │ A     ┆ 2      ┆ 1.5  │
# │ B     ┆ 3      ┆ 2.5  │
# │ C     ┆ 4      ┆ 3.5  │
# │ C     ┆ 5      ┆ 4.5  │
# │ C     ┆ 6      ┆ 5.5  │
# └───────┴────────┴──────┘
df.with_columns(
   MA2=pl.col("Values")
   .map_elements(lambda x: [print(x), pl.Series(ta.sma(x.to_pandas(), length=2))][1])
   .over("Group")
)
# shape: (2,)
# Series: '' [i64]
# [
#   1
#   2
# ]
# shape: (1,)
# Series: '' [i64]
# [
#   3
# ]
# shape: (3,)
# Series: '' [i64]
# [
#   4
#   5
#   6
# ]
# shape: (6, 3)
# ┌───────┬────────┬──────┐
# │ Group ┆ Values ┆ MA2  │
# │ ---   ┆ ---    ┆ ---  │
# │ str   ┆ i64    ┆ f64  │
# ╞═══════╪════════╪══════╡
# │ A     ┆ 1      ┆ null │
# │ A     ┆ 2      ┆ 1.5  │
# │ B     ┆ 3      ┆ null │
# │ C     ┆ 4      ┆ null │
# │ C     ┆ 5      ┆ 4.5  │
# │ C     ┆ 6      ┆ 5.5  │
# └───────┴────────┴──────┘
deanm0000 commented 8 months ago

I petitioned for this one to be reopened. It really seems like map_batches ought to "turn in to" map_elements when it's an over/agg.

MarcoGorelli commented 8 months ago

Thanks all for your comments!

Right, so going back to the original example, the solution is indeed to use map_elements (not map_batches!) followed by over:

import pandas as pd
import numpy as np
import pandas_ta as ta
import polars as pl
from polars.testing import assert_series_equal
np.random.seed(42)
df = pd.DataFrame({
    'Group': np.random.choice(['A', 'B', 'C'], size=50),
    'Values': np.random.rand(50)
})
df = df.sort_values(by='Group')
df["Moving_Average"] = (df.groupby('Group')["Values"].transform(lambda x: ta.sma(close=x, length=2))).bfill()
# Alternative:
def My_UDF_Function(x):
    return ta.sma(close=x, length=2)

dfpl = pl.from_pandas(df)

dfpl = (
    dfpl.with_columns(
        MA2=pl.col("Values")
        .map_elements(lambda x: pl.Series(ta.sma(x.to_pandas(), length=2)))
        .fill_null(strategy="backward")
        .over("Group")
    )
)

df["MA2"] = df.groupby('Group')["Values"].transform(My_UDF_Function).bfill()

dfpl = dfpl.with_columns(
    pandas_MA2=pl.from_pandas(df['MA2'])
)
assert_series_equal(dfpl['MA2'], dfpl['pandas_MA2'], check_names=False)

I'll try to update the docs

Mickychen00 commented 8 months ago

The user guide's UDF page is still to be updated. Current version talks about map and apply these two outdated functions.