h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
323 stars 85 forks source link

benchmark user defined functions (UDF) #57

Closed jangorecki closed 5 years ago

jangorecki commented 5 years ago

When using common functions like sum or mean most languages will redirect processing to fast C implementations. User defined functions for such tools would be significantly slower. There are languages, for example Julia, that compile every code they run, thus user defined functions can be fast there also. Therefore we should have a test that stress user defined functions. The important thing is to design such function properly, feedback is welcome.

bkamins commented 5 years ago

First of all: I would say that it is OK to focus tests on sum and mean as they should be optimized and they are most commonly used in practice.

The design of an appropriate additional test is tricky because:

Having said that a natural and a reasonably simple thing to do benchmark on is calculation of R^2 of a linear regression based on two columns i.e. the steps would be:

  1. take columns X and Y
  2. estimate the regression Y = a+bX in the most efficient way available
  3. calculate R^2 based on a, b, X and Y

and I would allow whatever is reasonable in a given language assuming that we use the same algorithm everywhere

The benefit of this test is that we check how the framework handles using two columns for data aggregation (sum and mean test only one column).

jangorecki commented 5 years ago

Thanks for input, note that we are not benchmarking languages but packages. Rcpp has to be excluded as it ships only interface, not a functions to solve the problem. It would be only applicable for writing UDF, so only a single question. I am not sure about regression, it is not kind a user defined function, users generally use libraries for doing regression. It also stress language heavily. What about

relative_even_ratio = function(x, y) sum(!x %% 2) / sum(!y %% 2)

count_greater = function(x, y, cheat.x=1, cheat.y=-1) sum((x+cheat.x) > (y+cheat.y))
# which could be called using `cheat.x=id4%%2` where id4 is int grouping variable
# so function would have different `cheat.x` value for different groups
bkamins commented 5 years ago

I proposed a regression because it is essentially calculation of variance and correlation, which should be available in any language out of the box. I agree that normally users calculate it using a package, but the point was to calculate something like:

cov(x,y)^2 / (var(x) * var(y))

(we could take any other expression - I just proposed something that is already defined)

bkamins commented 5 years ago

What you propose is also nice.

jangorecki commented 5 years ago

Closing as duplicate of #60, feedback welcome there. I discarded my example from here because it was kind of purely artificial one