h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
323 stars 85 forks source link

Expand grouping beyond sum and mean #59

Closed mattdowle closed 5 years ago

mattdowle commented 5 years ago

Only sum and mean are benchmarked currently, albeit on a reasonable range of cardinalities and data types. A ton quickly springs to mind to add, so it's more a question of balancing resources with potential insight. Perhaps for starters we can consider the following.

  1. Add median. It's a different ball game to sum and mean to calculate and is also very common in data science.
  2. Add a very simple user defined function (see #57 too); e.g. DT[, max(colB) - min(colB), by=colA] vs DT[, range(colB), by=colA]. Or something like DT[, 100*mean(colB), by=colA].

1 and 2 are still single aggregates though. Also common in data science is returning more than 1 value from each group, as follows.

  1. largest 2 values in ColB by group DT[, colB[head(order(ColB), 2)], by=colA]

and adding a column by group ( scale springs to mind ) :

  1. DT [, newCol := (colB - mean(colB)) / sd(colB), by=colA ] vs DT[, newCol := scale(colB), by=colA]

All of these new tests would need to be added for each product.

jangorecki commented 5 years ago

closing as duplicate of #60