devel/stable releases of solutions

jangorecki commented 4 years ago

Current state is as follows:

automatically upgraded to recent devel: data.table, pydatatable, ~~dplyr~~
manually upgraded to recent stable: cudf, clickhouse
automatically upgraded to recent stable: pandas, dask, spark, juliadf, dplyr

Ideally we wanted to have all solutions upgraded to recent devel automatically, problem is that most of them are not supporting devel upgrades in straightforward manner.

What I noticed that recent dplyr is much slower than it recently was. I am not 100% sure if it is caused by dplyr development. Timings for 1e7 rows data (0.5GB), previous run on 44cc2c1 (31/10) and recent one on a6658f2 (19/11).

                question 44cc2c1 a6658f2
1:         sum v1 by id1   0.240   1.513
2:     sum v1 by id1:id2   0.486   4.878
3: sum v1 mean v3 by id3   0.770  28.824
4:     mean v1:v3 by id4   0.601   2.145
5:      sum v1:v3 by id6   1.471  47.612

Here comes the question if we really want to present timings coming from recent development version of a tool. What is good is that developers of those tools can have a better insight into performance of their development version, but on the other hand they may be aware of some inefficiencies that they are going to address before stable release. Presenting inefficiencies in development version might look unfair, thus person to decide about htat should be an author of a tool. @hadley should we switch to use stable dplyr releases in benchmarks?

hadley commented 4 years ago

I don't really care; we are in the middle of a major rewrite of dplyr so performance benchmarks are not interesting to us at the moment.

jangorecki commented 4 years ago

I switched to use latest stable.

h2oai / db-benchmark

devel/stable releases of solutions #124