automatically upgraded to recent devel: data.table, pydatatable, dplyr
manually upgraded to recent stable: cudf, clickhouse
automatically upgraded to recent stable: pandas, dask, spark, juliadf, dplyr
Ideally we wanted to have all solutions upgraded to recent devel automatically, problem is that most of them are not supporting devel upgrades in straightforward manner.
What I noticed that recent dplyr is much slower than it recently was. I am not 100% sure if it is caused by dplyr development. Timings for 1e7 rows data (0.5GB), previous run on 44cc2c1 (31/10) and recent one on a6658f2 (19/11).
question 44cc2c1 a6658f2
1: sum v1 by id1 0.240 1.513
2: sum v1 by id1:id2 0.486 4.878
3: sum v1 mean v3 by id3 0.770 28.824
4: mean v1:v3 by id4 0.601 2.145
5: sum v1:v3 by id6 1.471 47.612
Here comes the question if we really want to present timings coming from recent development version of a tool. What is good is that developers of those tools can have a better insight into performance of their development version, but on the other hand they may be aware of some inefficiencies that they are going to address before stable release. Presenting inefficiencies in development version might look unfair, thus person to decide about htat should be an author of a tool. @hadley should we switch to use stable dplyr releases in benchmarks?
Current state is as follows:
dplyrIdeally we wanted to have all solutions upgraded to recent devel automatically, problem is that most of them are not supporting devel upgrades in straightforward manner.
What I noticed that recent dplyr is much slower than it recently was. I am not 100% sure if it is caused by dplyr development. Timings for 1e7 rows data (0.5GB), previous run on
44cc2c1
(31/10) and recent one ona6658f2
(19/11).Here comes the question if we really want to present timings coming from recent development version of a tool. What is good is that developers of those tools can have a better insight into performance of their development version, but on the other hand they may be aware of some inefficiencies that they are going to address before stable release. Presenting inefficiencies in development version might look unfair, thus person to decide about htat should be an author of a tool. @hadley should we switch to use stable dplyr releases in benchmarks?