otsaloma / dataiter

Python classes for data manipulation
https://dataiter.readthedocs.io/
MIT License
25 stars 0 forks source link

Shorten aggregate notation #9

Closed otsaloma closed 2 years ago

otsaloma commented 2 years ago

Currently we do

(data
 .group_by("year", "month")
 .aggregate(
     sales_total=lambda x: x.sales.sum(),
     sales_per_day=lambda x: x.sales.mean(),
 ))

With a lot of calculated columns, that gets a bit verbose with all the lambdas.

Maybe we could add helpers to shorten the lambdas in common cases?, e.g.

def mean(name):
    return lambda x: x[name].mean()

def sum(name):
    return lambda x: x[name].sum()

(data
 .group_by("year", "month")
 .aggregate(
     sales_total=di.sum("sales"),
     sales_per_day=di.mean("sales"),
 ))

Or, use a single lambda with a complex return value similar to Pandas' apply? Looks nice with a lot of columns, but really bad if only needing one column, such as in current notation .aggregate(n=di.nrow).

(data
 .group_by("year", "month")
 .aggregate(lambda x: {
     "sales_total": x.sales.sum(),
     "sales_per_day": x.sales.mean(),
 }))
otsaloma commented 2 years ago

Having function factories for common operations could allow a speed up by using Numba under the hood. If DataFrame.aggregate recognizes these special functions, it could make a single call instead of the current [function(x) for x in slices] thus placing the loop over the groups in the Numba code.

https://numba.pydata.org/