Open malexan opened 10 years ago
I am not sure which function in plyr you are referring, maybe you are refering to the foreach package?
Hadley's package are great tools for beginners, as they are simple and generate consistent output. However, the drawback is that they are generally inefficient.
Yes, about foreach: any **ply function from plyr package can run in parallel with foreach in backend.
Don't know about the inefficient in plyr package because of didn't compare. But there are some components in plyr what written in C.
Yes, unfortunately there are some work that can't be paralleled and for loop
are essential. The example is an implementation of the EM-algorithm which can not be split apart since its sequential. We are looking into writing some of this using Rcpp.
Here is some toy benchmarking example I did a while ago where data.table is around 4 times the speed of ddply. There are also some other benchmarking around the web and one of Hadley's goal is to make **ply as fast as data.table.
I would still however training the staff with **ply and data.frame first.
########################################################################
########################################################################
library(data.table) library(plyr) library(FAOSTAT) test.df = getFAO(name = "Value") test.dt = data.table(test.df) setkey(test.dt, FAOST_CODE, Year)
dtSum = function(dt){ dt[, j = sum(Value), by = "Year"] }
aggSum = function(df){ aggregate(df$Value, by = list(df$Year), FUN = sum) }
plyrSum = function(df){ ddply(.data = df, .variables = .(Year), .fun = function(x) sum(x$Value)) }
tapSum = function(df){ tapply(df$Value, df$Year, sum) }
benchmark(dtSum(test.dt), aggSum(test.df), plyrSum(test.df), tapSum(test.df), order = "relative", replications = 100)
Also, data.table is extremely efficient in memory usage.
http://www.r-statistics.com/2013/09/a-speed-test-comparison-of-plyr-data-table-and-dplyr/
Here is a slightly more comprehensive review, I think it would be great if you can train them in **ply and Hadley's tools since they are generally more elegant and consistent then move to other tools when efficiency are required.
Yes, we can take plyr, reshape2, stringr, lubridate, httr.
There is also very perspective for efficient code package testthat, but I've not enough experience with it yet.
To achieve more elegant code I would persuade users into using of plyr's functions instead of 'for' cycle.
Additional bonus from plyr: