Expand grouping beyond sum and mean

Only sum and mean are benchmarked currently, albeit on a reasonable range of cardinalities and data types. A ton quickly springs to mind to add, so it's more a question of balancing resources with potential insight. Perhaps for starters we can consider the following.

Add median. It's a different ball game to sum and mean to calculate and is also very common in data science.
Add a very simple user defined function (see #57 too); e.g. DT[, max(colB) - min(colB), by=colA] vs DT[, range(colB), by=colA]. Or something like DT[, 100*mean(colB), by=colA].

1 and 2 are still single aggregates though. Also common in data science is returning more than 1 value from each group, as follows.

largest 2 values in ColB by group DT[, colB[head(order(ColB), 2)], by=colA]

and adding a column by group ( scale springs to mind ) :

DT [, newCol := (colB - mean(colB)) / sd(colB), by=colA ] vs DT[, newCol := scale(colB), by=colA]

All of these new tests would need to be added for each product.

h2oai / db-benchmark

Expand grouping beyond sum and mean #59